Introduction

Here you can find lab notes and resources for Econ 401/511 Fall 2024. These will be updated after our in-class lab sessions. These notes are not a substitute for attending lab but serve as an additional resource.

Much of the lab content will be drawn from R for Data Science and from “Introduction to Mathematics for Economics with R” by Massimiliano Porto.

Useful R Resources

Getting Started with R – A collection of resources for those getting started with R

TidyR Cheatsheet – A useful cheatsheet for data cleaning and tidy data using the tidyverse functions

ggplot2 Cheatsheet – A useful cheatsheet for various ggplot geoms

Tidy Tuesday Repo – A weekly data science project in R to test your tidyverse skills!

R Setup and Workflow Basics

Before R: File Management

Effective file folder management is crucial for maintaining an organized and efficient digital workspace. Setting up organized folders will make your life significantly easier in the future!

  • Use your computer’s file management tools or RStudio’s “Files” tab.
  • Create a main folder for your project or course.
  • Go to the desired folder and create a new folder, e.g., “Lab1”
  • Save all downloaded data files and R Scripts in this folder

Installation

R can be installed on Windows, Mac, and Linux systems. You can download R here by selecting “CRAN” from the sidebar, choosing a mirror close to your location, and choosing the operating system you have: Download R

You will also need to install RStudio. RStudio is an integrated development enviroment (IDE) that makes working with R much more user friendly. You can download RStudio here.

When you want to use R, you will open RStudio to do your coding. You do not need to open R itself.

A Look Around RStudio

In the bottom left of RStudio, you will see the console. The console executes code. You can type code and execute it using the console but the code is not saved when you close R Studio. It is recommended that you do not use the console in your regular workflow.

To save your work, you should code in an R Script. Open a script using the button that looks like a piece of paper with a green plus sign in the top left corner of R Studio.

R scripts will open here. You can code, comment, and run the code from your script. To run the code, either click the “Run” button or by pressing CMD+Enter (Mac) or Ctrl+Enter (Windows). R scripts will be saved to the folder you are currently working in.

In the top left corner, we have the workspace/environment panes.

The workspace/enviroment tab tells you what objects are stored in R (i.e. what is loaded or stored in memory). The History tab which shows previous commands you have run.

Last, on the bottom right, we have several tabs including:

  • Files - shows the files on your computer in the directory you are working in
  • Viewer - can vew data or R objects
  • Help - shows help documentations for R functions and datasets
  • Plots - can see current and previous plots generated in your R session, save, and export them to png/pdf formats.
  • Packages - list of R packages you have installed. You can also install packages directly from this tab.

R Basics

Packages

When using R, you will almost certainly work with packages. A package is a collection of functions, data, and compiled code that is bundled together for easy sharing and reuse. Packages extend the functionality of R by providing tools for specific tasks, such as data manipulation, statistical modeling, or visualization. Users can install packages from repositories like CRAN (Comprehensive R Archive Network) or GitHub, and once installed, they can be loaded into an R session using the library() function to access the package’s features.

For example, let’s install the ggplot2 package. There are two ways to do this. First, we can write a line of code:

#install.packages("ggplot2")

Alternatively, you can install packages in RStudio by clicking on the packages tab

and then typing the name of the package in the pop-up window.

Once a package is installed, you need to load the package using library(). You will need to load the package you want to use anytime you start a new R session.

library(ggplot2)

Sometimes, it can be burdensome to use the library function when you have many packages to load. The pacman package helps with this by allowing you to load multiple packages in two lines of code! Install and load the pacman package, then use p_load() to load packages–you can load multiple at a time by separating the names with a comma.

library(pacman)
p_load(ggplot2)

Key Things to Know about R

Objects and the Assignment Operator

R is an object-oriented programming language, meaning that R organizes its code and data into objects. These objects can represent anything from simple data structures like numbers and strings to more complex things like models, plots, or datasets. The assignment operator <- is used to assign values to objects.

x <- 5 #assign the value of 5 to a variable called x
# notice that this x is now in your global environment
x # print x
## [1] 5
y = 10
y
## [1] 10

You can combine objects together as well which lets us do some basic math operations.

# create a new object called z that is equal to x*y
z <- x * y
#print z
z
## [1] 50

If you do not create an object, R will not save it to the global environment. If an object is not in the global environment and you try to reference it later, R will not know what you are referring to.

Classes

In R, there are different types of objects. We can check the type of object with the class() function.

class(x)
## [1] "numeric"

There are other classes too!

# create a string
my_string <- "Economics is my favorite subject!"
class(my_string)
## [1] "character"
# logical class
class(2>3)
## [1] "logical"

What happens if we have a vector of characters and numbers?

char_vector <- c(1:5, "banana", "apple")
char_vector
## [1] "1"      "2"      "3"      "4"      "5"      "banana" "apple"
#cant use mathematical operations on characters
# why?? because the entire vector is a character class!
class(char_vector)
## [1] "character"

We can coerce objects into other classes. For example, let’s say we have an objects below:

a <- 5
b <- "2"

If we try to combine these objects into a new one called ab that is equal to a*b, we will get an error. Why? Because a is a numeric object and b is a character object.

class(a)
## [1] "numeric"
class(b)
## [1] "character"

We can coerce b into numeric using the as.numeric() function and then we can perform the operation.

b <- as.numeric(b)

ab <- a*b
ab
## [1] 10

Other functions to coerce a class change are: - as.integer() - as.character() - as.data.frame() - etc.

Note that R does not know how to convert a string of letters into numeric and will give us a warning message. For example,

my_string <- as.numeric(my_string)
## Warning: NAs introduced by coercion

The c() Function

The c() function is used to concatenate items separated by a comma.

d <- c(1, 2, 3, 4, 5)
d
## [1] 1 2 3 4 5
e <- c("a", "b", "c", "d", "e")
e
## [1] "a" "b" "c" "d" "e"

You can also concatenate previously generated objects. Note that the c() function cannot store items with different classes and R will coerce the different types to one type.

dab <- c(d, a, b)
dab
## [1] 1 2 3 4 5 5 2
de <- c(d, e)
de
##  [1] "1" "2" "3" "4" "5" "a" "b" "c" "d" "e"

The list() function can be used to store objects in a single object keeping their original class characterisitics.

de_list <- list(d,e)
de_list
## [[1]]
## [1] 1 2 3 4 5
## 
## [[2]]
## [1] "a" "b" "c" "d" "e"

The Square Bracket Operator

The square bracket [ ] allows us to subset, extract, or replace a part of an object. For example, we can select the first entry in our object e

e[1]
## [1] "a"

Running e again will show that no modification has been made to the object. We can use the square bracket to replace parts of the object.

e[1] <- "m"
e
## [1] "m" "b" "c" "d" "e"

We can also combine the c() operator and the square bracket to subset more than one value.

e[c(1,3)]
## [1] "m" "c"

Now, let’s apply the square bracket operator to a two-dimensional object: a data frame.

df <- data.frame(numbers = d,
                 letters = e)
df
##   numbers letters
## 1       1       m
## 2       2       b
## 3       3       c
## 4       4       d
## 5       5       e

The structure of our object df is rows per columns. We can use the square bracket operator but need to specify the row and the column.

df[4, 2]
## [1] "d"

This gives us the item in row 4 and column 2. If we want to select all of the rows for the first column, we can leave the row blank. Likewise, we can do the same for columns in row 1.

df[, 1]
## [1] 1 2 3 4 5
df[1, ]
##   numbers letters
## 1       1       m

Last, we can replace elements in the data frame with the square bracket operator.

df[1,1] <- 10
df
##   numbers letters
## 1      10       m
## 2       2       b
## 3       3       c
## 4       4       d
## 5       5       e

Code Style

Coding style is the punctuation and grammer of the coding world. Making sure that your code is formatting in a readable, standard format is helpful for yourself and others to understand the code. We will follow guidelines from the tidyverse style guide

Spaces: Put spaces on either side of mathematical operators (i.e. +, -, ==, <, …), and around the assignment operator (<-). The exception to this is the ^ symbol.

# Strive for
z <- (a + b)^2 / ab

# Avoid
z<-( a + b ) ^ 2/ab

Don’t put spaces inside or outside parentheses for regular function calls. Always put a space after a comma.

# Strive for
mean(x, na.rm = TRUE)
## [1] 5
# Avoid
mean (x ,na.rm=TRUE)
## [1] 5

Adding extra spaces is fine if it helps with alignment. For example:

example_data_frame <- 
  data.frame(
    variable1      = c(1:10),
    variable_name2 = c(2:11),
    var_name       = c(3:12)
  )

Naming Conventions: Object names must start with a letter and can only contain letters, numbers, _, and .. The names should be descriptive–snake case is the recommended naming convention (separating lowercase words with _).

really_long_variable_name <- 1

Commenting: You can comment your code with #. It is strongly recommended to leave comments in your code so that others, and future you, can keep track of your thought process.

# Good code is well-commented code!!

You can also create section comments that will be collapsable. This is incredibly helpful when you have a really long R script! Any comment line which includes at least four trailing dashes (-) will create a section. Sections will appear in the table of contents in your R script (above the Console) and allows for quick navigation.

# This is section 1 ----

# ---- This is section 2 ----

Graphing with Ggplot

The basic setup of making a ggplot requires three things: the data, the aesthetic mapping, and a geom. The aesthetic mappings describe how variables in the data are mapped to visual properties (aesthetics) of geoms, like which variables are on the axes, the variable to color or fill by, etc. The geoms tell R how to draw the data like points, lines, columns, etc.

In general, we can make a ggplot by typing the following: ggplot(data = ) + (mapping = aes())

The way ggplot works is by adding layers. We can add a new layer with the + sign. Let’s build a ggplot step by step. First, start with ggplot() and tell R what data we are using.

df <- data.frame(xpoints = seq(-10, 10, 1), ypoints = seq(-10, 10, 1))
ggplot(data = df)

Why did this make a blank graph? Well, we haven’t given R the aesthetic mapping yet so it doesn’t know what to put on top of the base layer. Let’s add the x and y variables.

ggplot(data = df, mapping = aes(x = xpoints, y = ypoints)) # note you can put the mapping here or in the geom

Now we have a graph with axes and gridlines but no information on the graph. To get data on the graph, we need to tell R how we want to draw the data with a geom. To make a scatterplot, we use geom_point().

ggplot(data = df, mapping = aes(x = xpoints, y = ypoints)) + geom_point()

Ggplot allows even more layering. We can add a line through these points with geom_line().

ggplot(data = df, mapping = aes(x = xpoints, y = ypoints)) + geom_point() + geom_line()

Linear Functions

First, we need some data to start with. Let’s generate an x variable from -10 to 10. Then, graph the equation \(y = 4 + 3x\).

df <- data.frame(x = seq(-10, 10, 1))

ggplot(df, aes(x = x)) +
  stat_function(fun = function(x) 4 + 3*x, color = "blue")

Now, let’s make the graph look a bit nicer and add the functions \(y = 3x\) and \(y = -4 + 3x\).

ggplot(df, aes(x = x)) +
  stat_function(fun = function(x) 4 + 3*x, color = "blue") +
  stat_function(fun = function(x) 3*x, color = "green") +
  stat_function(fun = function(x) -4 + 3*x, color = "red") +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  theme_minimal() +
  ggtitle("Graph of Linear Functions")

Linear Application to Economics

Let’s start by graphing a firm’s cost functions. We can decompose a firm’s total cost into fixed cost (FC) and variable cost (VC). Assume the firm has a fixed cost of $5000 and a variable cost of $125 per output. Let’s graph the total cost.

x <- seq(0, 50, 1)
FC <- 5000
VC <- 125
TC <- FC + VC*x

df <- data.frame(output = x, 
                 total_cost = TC)

head(df)
##   output total_cost
## 1      0       5000
## 2      1       5125
## 3      2       5250
## 4      3       5375
## 5      4       5500
## 6      5       5625
ggplot(df, aes(x = output, y = total_cost)) + 
  geom_line(color = "blue") +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  labs(x = "Output",
       y = "Total Cost") +
  theme_minimal()

Next, let’s graph profit, revenue, and cost equations to find the break-even profit. Assume the firm sells the product for $250 each and has the total cost function from above.

p <- 250
R <- p*x
pi <- R - TC

df <- cbind(df, revenue = R, profit = pi)
head(df)
##   output total_cost revenue profit
## 1      0       5000       0  -5000
## 2      1       5125     250  -4875
## 3      2       5250     500  -4750
## 4      3       5375     750  -4625
## 5      4       5500    1000  -4500
## 6      5       5625    1250  -4375
ggplot(df) +
  geom_line(aes(x = output, y = total_cost, color = "total_cost")) +
  geom_line(aes(x = output, y = revenue, color = "revenue")) +
  geom_line(aes(x = output, y = profit, color = "profit")) +
    geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  labs(x = "Output", y = "") +
  theme_minimal() + 
  theme(legend.title = element_blank())

The breakeven profit is where the firm sells 40 units of output.

Quadratic Functions

We can use similar code to graph other types of functions as well. For example, the stat_function can be used to graph quadratic functions:

df <- data.frame(x = seq(-10, 10, by = 0.1))

ggplot(df, aes(x = x)) +
  stat_function(fun = function(x) x^2, color = "blue") +
  stat_function(fun = function(x) x^2 + 5, color = "orange") +
  stat_function(fun = function(x) 4*x^2, color = "red") +
  stat_function(fun = function(x) 4*x^2 + 10, color = "purple") +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  theme_minimal() +
  ggtitle("Graph of Quadratic Functions")

Quadratic Application to Economics

Suppose we have the cost function \(C(x) = 0.01x^2 + x + 10\). Let’s plot the total cost, fixed cost, variable cost, and the average cost.

x <- seq(0, 50, 1)
FC <- 10
VC <- 0.01*x^2 + x
TC <- FC + VC

df <- data.frame(output = x,
                 total_cost = TC,
                 fixed_cost = FC,
                 variable_cost = VC
                )

head(df)
##   output total_cost fixed_cost variable_cost
## 1      0      10.00         10          0.00
## 2      1      11.01         10          1.01
## 3      2      12.04         10          2.04
## 4      3      13.09         10          3.09
## 5      4      14.16         10          4.16
## 6      5      15.25         10          5.25

Next, we need to compute the average cost and add it to our data frame. Recall, average cost is the total cost divided by output.

df$average_cost <- df$total_cost / df$output
head(df)
##   output total_cost fixed_cost variable_cost average_cost
## 1      0      10.00         10          0.00          Inf
## 2      1      11.01         10          1.01    11.010000
## 3      2      12.04         10          2.04     6.020000
## 4      3      13.09         10          3.09     4.363333
## 5      4      14.16         10          4.16     3.540000
## 6      5      15.25         10          5.25     3.050000

Oh no! the first value of average cost is undefined… we should remove that.

df <- df[-1,]
head(df)
##   output total_cost fixed_cost variable_cost average_cost
## 2      1      11.01         10          1.01    11.010000
## 3      2      12.04         10          2.04     6.020000
## 4      3      13.09         10          3.09     4.363333
## 5      4      14.16         10          4.16     3.540000
## 6      5      15.25         10          5.25     3.050000
## 7      6      16.36         10          6.36     2.726667

Much better! Let’s also do one more cool trick to make graphing a bit easier for us. What if instead of specifying separate groups and writing functions three or more times (like we did before!), we could just do one line of code in our ggplot? To do this, we need to reshape the data from wide to long.

df_long <- reshape(
  df,
  varying = c("total_cost", "fixed_cost", "variable_cost", "average_cost"),
  v.names = "cost", #new column for the cost value
  timevar = "cost_type",#new column for the type of cost 
  times = c("total_cost", "fixed_cost", "variable_cost",  "average_cost"), #names of the types of cost
  idvar = "output",
  direction = "long"
)

head(df_long)
##              output  cost_type  cost
## 1.total_cost      1 total_cost 11.01
## 2.total_cost      2 total_cost 12.04
## 3.total_cost      3 total_cost 13.09
## 4.total_cost      4 total_cost 14.16
## 5.total_cost      5 total_cost 15.25
## 6.total_cost      6 total_cost 16.36
tail(df_long)
##                 output    cost_type     cost
## 45.average_cost     45 average_cost 1.672222
## 46.average_cost     46 average_cost 1.677391
## 47.average_cost     47 average_cost 1.682766
## 48.average_cost     48 average_cost 1.688333
## 49.average_cost     49 average_cost 1.694082
## 50.average_cost     50 average_cost 1.700000

Notice we have a long-form data frame with a column that specifies the type of cost. This was not necessary to do but will make graphing MUCH easier!

ggplot(df_long, aes(x = output, y = cost, group = cost_type, color = cost_type)) +
    geom_line() +
    geom_hline(yintercept = 0) +
    geom_vline(xintercept = 0) +
    theme_minimal() +
    ggtitle("Firm Cost Functions")

Writing Functions in R

Before we talk about functions in R, we should first make sure we understand what a function is in general. You all are probably used to thinking of something like:

     `f(x)= x^2`

But what really does this even mean? Why is this a useful tool? - It takes some input (a number), and gives us an output (in this case, the squared number, which is indeed also a number). We can represent the function \(f\) above as: Number -> ADifferentNumber where that arrow is something we call a mapping. A mapping links one object to another. This doesn’t have to be two numbers, it could be a word and a dataframe, or two words. For instance, I could come up with a function called color labeler and then pass it some object, which it will map to a color.

color_labeler(banana) = yellow

Where color_labeler is the name of the function, that maps fruits(inputs) to colors(outputs)

Knowing all of this, we need to go back to lecture 1 to remind ourselves about R-facts before we make a function.

  • Everything in R has a name, and everything is an object
  • Functions have inputs/arguments and outputs

With these in mind, let’s write a function that will take some x, and spit out x^2 + 10. That is, let’s write the code for f(x)=x^2+10.

Everything in R needs a name That includes our function.
We do this by starting with a name, setting it equal to a ‘function()’ function.

In general, this function looks like function([some_set_of_arguments]){your function code}. function() is a special operator that takes any arguments you want in the parentheses, and then lets you manipulate them in any way you see fit. Think of the parentheses here as the toys you’re giving your computer to play with inthe sandbox.

squarePlusten = function(x){
  #tell squarePlusten what to do. x is an input here, we can tell our function to transform our variable into
  #something else.
  x_squaredten = x^2+10
  #Now, in order to make use of this value, we need our function to spit something out. 
  #We do this with another special function, 'return()'. 
  return(x_squaredten)
}

notice however: x_squaredten isn’t equal to anything now that our function has been run. That’s because x_squaredten is only defined in the context of your written function.

x_squaredten

Now, we can use the function by calling the function name and giving it an input:

#check for 10. should be 110
squarePlusten(10)
## [1] 110

A couple things to note:

  • Brackets around functions lets R know where the function starts and ends.
  • Technically we do not need them for one line functions
  • Return: this tells the function what it should return for us. We don’t hold onto any temporary objects created while your function runs, though the function can modify objects outside of itself.

Remember, functions don’t just change numbers to numbers. A more accurate way to think of functions is as a map from one object to another. The starting point is the argument, and the end point is the output. The functions take an object as an input.

Now, let’s write a function to graph tangent lines (we will use this later). We will need the packages tidyr and ggplot2, so be sure to install and load those. The function rearranges and plots data. We use pivot_longer() to reshape all the data except column x. The %>% operator is from tidyr that pipes an object forward into a function.

library(tidyr)
library(ggplot2)

tangent_line <- function(df_fn, XLIM = NULL,  YLIM = NULL,  XLAB = "x", YLAB = "y"){
  df1 <- df_fn %>% pivot_longer(!x)
  
  g <- ggplot() +
    geom_line(data = df1, aes(x = x, y = value, group = name, color = name)) +
    geom_hline(yintercept = 0) + 
    geom_vline(xintercept = 0) +
    coord_cartesian(xlim = XLIM, ylim = YLIM) +
    labs(x = XLAB, y = YLAB) + 
    theme_minimal()
  
  return(g)
}

Derivatives

We can compute derivatives with R by using the Deriv() function in the Deriv package. Be sure to install and load the package.

library(Deriv)

Let’s find the derivative of the function \(y = x^2\). First, we generate an expression containing the derivative we want to compute, stored in the object y.

y <- expression(x^2)

This expression will be the the first entry of the Deriv() function; the second entry is a character vector that tells the function which variable to take the derivative with respect to.

dydx <- Deriv(y, "x")
dydx
## expression(2 * x)

Let’s now compute the derivative of \(y = 2x^2 + 3x^3\).

y <- expression(2*x^2 + 3*x^3)
dydx <- Deriv(y, "x")
dydx
## expression(x * (4 + 9 * x))

Let’s try a log and an exponential function.

y <- expression(log(x))
dydx <- Deriv(y, "x")
dydx
## expression(1/x)
y <- expression(exp(x))
dydx <- Deriv(y, "x")
dydx
## expression(exp(x))

We can compute the second derivative by adding the argument nderiv.

y <- expression(x^2)
d2ydx2 <- Deriv(y, "x",  nderiv = 2)
d2ydx2 
## expression(2)

Let’s use our tangent_line() function from earlier. Suppose we have the function \(y = x^3 - 4x^2 + x + 6\) and want to graph the tangent lines at the points (5, 36), and (-2, -20). First, find the derivative and plug in the values of x.

y <- expression(x^3 - 4*x^2 + x + 6)
dydx <- Deriv(y, "x")
dydx
## expression(1 + x * (3 * x - 8))

The derivative is equal to \(3x^2 - 8x + 1\). At (5, 36) the equation of the tangent line is \(y = 36x - 144\) and at (-2, -20) the tangent line is \(y = 29x + 38\).

x <- seq(-10, 10, 0.1)
y <- x^3 - 4*x^2 + x + 6
tg1 <- 36*x - 144
tg2  <- 29*x + 38

df <- data.frame(x, y, tg1, tg2)

tangent_line(df, XLIM = c(-10, 10),  YLIM = c(-40, 40))

Derivative Applications in Economics

Marginal Cost

Given the following total cost function \(TC = VC_3 Q^3 - VC_2 Q^2 + VC_1 Q + FC\), the marginal cost is the derivative.

TC <- expression(0.009*Q^3 - 0.5*Q^2 + 15*Q  + 35)
MC <- Deriv(TC,  "Q")
MC
## expression(15 + Q * (0.027 * Q - 1))

To find the marginal cost at specific values of Q, we need to specify a sequence of output values and use the eval function.

Q <- seq(0, 50 , by = 1)
MC <- eval(parse(text = MC))
head(MC)
## [1] 15.000 14.027 13.108 12.243 11.432 10.675

Now, let’s code functions to compute the total cost, marignal cost, and the y-intercept of a linear function.

# total cost function
total_cost <- function(Q, VC1, VC2, VC3, FC){
  TC <- VC3*Q^3 + VC2*Q^2 + VC1*Q + FC
  return(TC)
}

# marginal cost function
marginal_cost <- function(Q, VC1, VC2, VC3, FC){
  TC <- expression(VC3*Q^3 + VC2*Q^2 + VC1*Q + FC)
  MC <- Deriv(TC, "Q")
  return(eval(parse(text = MC)))
}

# y-intercept function
y_intercept <- function(TC, MC, Q){
  a <- TC - MC*Q
  return(a)
  } 

Now, we can find the tangent lines to the cost function. Let’s set \(Q = 10\) and \(Q = 45\).

# total costs
TC10 <- total_cost(10, 15, -0.5, 0.009, 35)
TC10
## [1] 144
TC45 <- total_cost(45, 15, -0.5, 0.009, 35)
TC45
## [1] 517.625
# marginal costs
MC10 <- marginal_cost(10, 15, -0.5, 0.009, 35)
MC10
## [1] 7.7
MC45 <- marginal_cost(45, 15, -0.5, 0.009, 35)
MC45
## [1] 24.675
# y-intercepts
a10 <- y_intercept(TC10, MC10, 10)
a45 <- y_intercept(TC45, MC45, 45)

# tangent lines
tg10 <- a10 + MC10*Q
tg45 <- a45 + MC45*Q

We want all of the total cost and marginal cost values for our sequence of output. We can use the functions to do this!

Q <- seq(0, 50 , by = 1)
tc <- total_cost(Q, 15, -0.5, 0.009, 35)
mc <- marginal_cost(Q, 15, -0.5, 0.009, 35)

df <- data.frame(x = Q, 
                 total_cost = tc,
                 maringal_cost = mc,
                 tangent10 = tg10,
                 tangent45 = tg45)
head(df)
##   x total_cost maringal_cost tangent10 tangent45
## 1 0     35.000        15.000      67.0  -592.750
## 2 1     49.509        14.027      74.7  -568.075
## 3 2     63.072        13.108      82.4  -543.400
## 4 3     75.743        12.243      90.1  -518.725
## 5 4     87.576        11.432      97.8  -494.050
## 6 5     98.625        10.675     105.5  -469.375
tangent_line(df, XLIM = NULL, YLIM = c(0, 600), XLAB = "Output", YLAB = "Cost")

Marginal and Average Cost

Let’s add the average cost for this firm. Adding the average cost to the data frame is the same produre as we have seen previously.

df2 <- data.frame(x = Q, 
                 marginal_cost = mc)

average_cost <- tc/Q

df2 <- cbind(df2, average_cost)
df2 <- df2[-1,]
head(df2)
##   x marginal_cost average_cost
## 2 1        14.027     49.50900
## 3 2        13.108     31.53600
## 4 3        12.243     25.24767
## 5 4        11.432     21.89400
## 6 5        10.675     19.72500
## 7 6         9.972     18.15733

Now, let’s make the plot.

names(df2) <- c("output", "MC", "AC")
ggplot(df2) + 
  geom_line(aes(x = output, y = AC), color = "blue") +
  geom_line(aes(x = output, y = MC), color = "red") +
  theme_minimal()

Matrices

In R, we build a matrix using the matrix() function. The first entry is the data, nrow is the number of rows, and ncol is the number of columns. the argument byrow can be set to TRUE, which fills the matrix by columns, or FALSE (the default), which fills the matrix by rows.

For example,

A <- matrix(c(1, 2, 3, 
              4, 5, 6),
            nrow = 2,
            ncol = 3,
            byrow = FALSE)

A
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

gives the matrix \[ \mathbf{A} = \begin{bmatrix} 1 & 3 & 5 \\ 2 & 4 & 6 \\ \end{bmatrix} \] and

A <- matrix(c(1, 2, 3, 
              4, 5, 6),
            nrow = 2,
            ncol = 3,
            byrow = TRUE)

A
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

gives the matrix \[ \mathbf{A} = \begin{bmatrix} 1 & 3 & 5 \\ 2 & 4 & 6 \\ \end{bmatrix} \]

Matrix Operations

Recall from lecture, addition of matrices requires the matrices to have the same size. We define two matrices: \[ \mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \\ \end{bmatrix} \] \[ \mathbf{B} = \begin{bmatrix} -2 & 3 \\ 5 & -1 \\ 2 & 2 \\ \end{bmatrix} \] To add these in R:

A <- matrix(c(1, 2,
              3, 4,
              5, 6),
              nrow = 3,
              ncol = 2,
              byrow = T)

B <- matrix(c(-2,  3,
              5, -1,
              2, 2),
              nrow = 3,
              ncol = 2,
              byrow = T)

A + B
##      [,1] [,2]
## [1,]   -1    5
## [2,]    8    3
## [3,]    7    8

For scalar multiplication, we can specify a constant and multiply it with a matrix object using *.

6*A
##      [,1] [,2]
## [1,]    6   12
## [2,]   18   24
## [3,]   30   36

Recall, from lecture, that to multiply two matrices the column dimension of the first matrix must be equal to the row dimension of the second matrix. In R, we compute matrix multiplication with %*%

For example, suppose we have \[ \mathbf{A} = \begin{bmatrix} 2 & 5\\ \end{bmatrix} \] and \[ \mathbf{B} = \begin{bmatrix} -2 & 3 &6 \\ 9 & 0 & 2 \\ \end{bmatrix} \] Can we multiply these matrices? Yes, A is a a 1x2 matrix and B is a 2x3. We should get a 1x3 matrix after multiplying.

A <- matrix(c(2, 5),
            nrow = 1,
            ncol = 2,
            byrow = T)

B <- matrix(c(-2, 3, 6,
              9, 0, 2),
            nrow = 2,
            ncol = 3,
            byrow = T)
A %*% B
##      [,1] [,2] [,3]
## [1,]   41    6   22

Now suppose we have the matrices \[ \mathbf{A} = \begin{bmatrix} -2 & 3 & 4 & 6\\ 4 & -4 & 3 & 0 \\ 1 & 8 & 5 & 3\\ \end{bmatrix} \] \[ \mathbf{B} = \begin{bmatrix} 1 & -2 \\ 5 & 3 \\ 5 & 4 \\ 7 & 8 \\ \end{bmatrix} \] We can multiply these in R:

A <- matrix(c(-2, 3, 4, 6,
              4, -4, 3, 0,
              1, 8, 5, 3),
            nrow = 3,
            ncol = 4,
            byrow = T)

B <- matrix(c(-1, 2,
              5, 3,
              5, 4,
              7, 8),
            nrow = 4,
            ncol = 2,
            byrow = T)

A %*% B
##      [,1] [,2]
## [1,]   79   69
## [2,]   -9    8
## [3,]   85   70

Systems of Linear Equations

Suppose we have the system of equations: \[ x + y = 4 \]

\[ 2x + y = 7 \] Recall, we can use matrices to write systems of linear equations in \(Ax = d\) format. For this system, \[ \begin{bmatrix} 1 & 1 \\ 2 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ \end{bmatrix} = \begin{bmatrix} 4 \\ 7 \\ \end{bmatrix} \]

To solve the system in R, we use the solve() function.

A <- matrix(c(1, 1,
              2, 1),
              nrow = 2,
              ncol = 2,
              byrow = T)

d <- c(4, 7)

solve(A, d)
## [1] 3 1

For this system \(x^* = 3\) and \(y^* = 1\).

Next, let’s solve the system \[ x_1 + 2x_2 + 3x_3 + 5x_4 = 5 \]

\[ 2x_1 + 3x_2 + 5x_3 + 9x_4 = 4 \]

\[ 3x_1 + 4x_2 + 7x_3 + x_4 = 0 \] \[ 7x_1 + 6x_2 + 5x_3 + 4x_4 = 3 \]

In R, we solve:

A <- matrix(c(1, 2, 3, 5,
              2, 3, 5, 9,
              3, 4, 7, 1,
              7, 6, 5, 4),
            nrow = 4,
            ncol = 4,
            byrow = T)

d <- c(5, 4, 0, 3)

solve(A, d)
## [1] -5.03125  8.46875 -2.71875  0.25000

Determinants

Determinants can be found using the det() function. Recall, matrices must be square.

For a 2x2 matrix:

A <- matrix(c(3, 2,
              2, 6),
            nrow = 2,
            ncol = 2,
            byrow = T)

det(A)
## [1] 14

For a 3x3 matrix:

A <- matrix(c(2, 4, 3,
              -1, 3, 0,
              0, 2, 1),
            nrow = 3,
            ncol = 3,
            byrow = T)

det(A)
## [1] 4

For a 4x4 matrix:

A <- matrix(c(-2, 3, 4, 1,
              4, -4, 3, 0,
              1, 2, 5, 3,
              -1, -2, 5, 3),
            nrow = 4,
            ncol = 4,
            byrow = T)

det(A)
## [1] 294

Cramer’s Rule

Suppose we want to solve the following system using Cramer’s Rule: \[ 2x + y - z = 4\]

\[x - 2y + z = 1\]

\[3x - y - 2z = 3\]

In matrix form, this is: \[ \begin{bmatrix} 2 & 1 & -1 \\ 1 & -2 & 1 \\ 3 & -1 & -2 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ \end{bmatrix} = \begin{bmatrix} 4 \\ 1 \\ 3 \\ \end{bmatrix} \]

There is no built-in function in R for Cramer’s Rule, so we will have to do the steps ourselves. First, create the \(A\) matrix and the vector of constants \(d\).

A <-  matrix(c(2, 1, -1,
               1, -2, 1,
               3, -1, -2),
             nrow = 3,
             ncol = 3,
             byrow = T)

d <- c(4, 1, 3)

Next, we will create new matrices for the \(A\) matrix with the \(d\) column inserted.

Ax <- A
Ax[,1] <- d

Ay <- A
Ay[,2] <- d

Az <- A
Az[,3] <- d

Now we can use these matrices to solve with Cramer’s Rule.

x <- det(Ax)/det(A)
x
## [1] 2
y <- det(Ay)/det(A)
y
## [1] 1
z <- det(Az)/det(A)
z
## [1] 1

Of course, you can always write your own function to do Cramer’s Rule!

cramers_rule <- function(A, d) {
  n <- nrow(A)
  det_A <- det(A)  
  if (det_A == 0) {
    stop("The determinant of A is 0, so the system has no unique solution.")
  }
  
  # Initialize solution vector
  X <- numeric(n)
  
  # Loop through each variable
  for (i in 1:n) {
    A_i <- A  # Copy of matrix A
    A_i[, i] <- d  # Replace i-th column with vector d
    X[i] <- det(A_i) / det_A  
  }
  
  return(X)
}

Let’s use the function:

cramers_rule(A, d)
## [1] 2 1 1

Super fast and easy! :D

Jacobian Matrix

Suppose that the demand functions for good 1, \(Q_1\), and good 2, \(Q_2\), are the following: \[ Q_1 = 4P_1^{3/2}P_2^{1/2}Y\]

\[ Q_2 = 2P_1^{1/2}P_2^{1/2}Y\] Suppose that \(P_1^* = 4\), \(P_2^* = 6\) and \(Y^* = 2000\).

First, write the function:

f <- function(x){
  c(4*x[1]^(3/2)*x[2]^(1/2)*x[3],
    2*x[1]^(1/2)*x[2]^(1/2)*x[3])
}

Next, we can find the Jacobian matrix (plugging in the values for \(P_1^* = 4\), \(P_2^* = 6\) and \(Y^* = 2000\)) by using the jacobian() function from the pracma package. Make sure to install and load the package before using.

library(pracma)

J <- jacobian(f, c(4, 6, 2000))
J
##          [,1]      [,2]      [,3]
## [1,] 58787.75 13063.945 78.383672
## [2,]  2449.49  1632.993  9.797959

Optimization with R

Determining Critical Values

[Example 1:] Find the minimum of the function: \(z = x^3 + 8y^3 - 12xy\)

First, write your function. Then, use the optim() function to find the critical values. Note that the optim() function will minimize a function by default. The function also requires initial values–these are given as 0,0 in the function.

f <- function(x){
  x[1]^3 + 8*x[2]^3 - 12*x[1]*x[2]
}

optim(c(0,0), f)
## $par
## [1] 2 1
## 
## $value
## [1] -8
## 
## $counts
## function gradient 
##      143       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

par returns the best set of parameters (i.e., the critical values) and value returns the value of the function at the critical values.

[Example 2:] Find the critical values of the following function: \(z = -2x^2 - y^2 + 2xy + 4x\)

To find the maximums, use control=list(fnscale=-1) as an argument in the optim() function.

f <- function(x){
  -2*x[1]^2 - x[2]^2 + 2*x[1]*x[2] + 4*x[1]
}

optim(c(0,0), f, control=list(fnscale=-1))
## $par
## [1] 2 2
## 
## $value
## [1] 4
## 
## $counts
## function gradient 
##      133       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Multi-Product Profit Maximization Example

Suppose that for our problem, we have (inverse) demand equations for two products and a cost function. \[ P_1 = 38 - Q_1 - 2Q_2\] \[ P_2 = 90 - 2Q_1 - 4Q_2\] \[ C = 3Q_1^2 - 2Q_1Q_2 + 2Q_2^2 + 100\]

Consequently then, the profit function is: \[\pi = -4Q_1^2 - 6Q_2^2 - 2Q_1Q_2 + 38Q_1 + 90Q_2 - 100 \]

profit <- function(Q){
  -4*Q[1]^2 - 6*Q[2]^2 - 2*Q[1]*Q[2] + 38*Q[1] + 90*Q[2] - 100
}

optim(c(0,0), profit, control = list(fnscale = -1))
## $par
## [1] 2.999819 6.999740
## 
## $value
## [1] 272
## 
## $counts
## function gradient 
##       83       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Constrained Optimization

Let’s optimize the function \(z = xy + 2x\) subject to \(2x + 5y = 90\) using the nloptr package.

# Load the nloptr package
library(nloptr)

First, we need to define the objective function and the constraint. Note that just like the optim() function, nloptr minimizes by default, so if we want to maximize, we need to negate the objective function.

objective_fn <- function(x) {
  # x[1] represents x and x[2] represents y
  return(- (x[1] * x[2] + 2 * x[1]))  
}

constraint_fn <- function(x) {
  # Write the constraint as 2x + 5y - 90
  return(2 * x[1] + 5 * x[2] - 90)
}

Then, we need to set some initial values. x0 is the initial guess or the “starting point”. We also need to set lower and upper bounds for x and y. Here, x and y will be non-negative so we set the lower bounds at 0,0.

x0 <- c(1, 1)           
lower_bounds <- c(0, 0)  
upper_bounds <- c(100, 100)  

Next, we define the options for the nloptr function. For optimization problems with equality constraints, local_opts is required. The main algorithm, specified by opts, is NLOPT_LN_AUGLAG_EQ, an augmented Lagrangian method designed to handle equality constraints in optimization. The option xtol_rel sets a relative tolerance for the solution, meaning the algorithm will stop when the improvement between iterations falls below this threshold, ensuring a reasonably precise solution. The option maxeval limits the algorithm to a set amount of evaluations, preventing excessive computation. The tol_constraints_eq option sets a high precision for satisfying equality constraints, so the algorithm will try to ensure that constraints are met within a very small tolerance. Last, local_opts specifies a local optimizer to use within AUGLAG_EQ called NLOPT_LN_SBPLX, a derivative-free algorithm known for stability in constrained optimization.

local_opts <- list("algorithm" = "NLOPT_LN_SBPLX", "xtol_rel" = 1e-6)

opts <- list("algorithm" = "NLOPT_LN_AUGLAG_EQ",
             "xtol_rel" = 1e-6,
             "maxeval" = 10000,
             "tol_constraints_eq" = 1e-8,
             "local_opts" = local_opts)

Finally, we can run the optimization algorithm.

result <- nloptr(x0 = x0,
                 eval_f = objective_fn,
                 lb = lower_bounds,
                 ub = upper_bounds,
                 eval_g_eq = constraint_fn, # Equality constraint
                 opts = opts)


result
## 
## Call:
## 
## nloptr(x0 = x0, eval_f = objective_fn, lb = lower_bounds, ub = upper_bounds, 
##     eval_g_eq = constraint_fn, opts = opts)
## 
## 
## Minimization using NLopt version 2.7.1 
## 
## NLopt solver status: 5 ( NLOPT_MAXEVAL_REACHED: Optimization stopped because 
## maxeval (above) was reached. )
## 
## Number of Iterations....: 10001 
## Termination conditions:  xtol_rel: 1e-06 maxeval: 10000 
## Number of inequality constraints:  0 
## Number of equality constraints:    1 
## Current value of objective function:  -249.997560653075 
## Current value of controls: 24.99981 7.99998

The optimal values are \(x^* = 25\) and \(y^* = 8\). The value of the objective function is \(250\) (remember to multiply the result by -1 since we are maximizing). Note that there is a bit of randomness in the algorithm so answers may vary slightly.

Utility Maximization

Consider the following maximization problem: \[ \text{max } xy \text{ subject to } 10x + 5y = 100\]

Solve with nloptr as before. We will keep the same x0, bounds, and opts, so we just need to change the objective function and constraint.

objective_fn <- function(x) {
  # x[1] represents x and x[2] represents y
  return(- (x[1] * x[2]) ) 
}

constraint_fn <- function(x) {
  return(10 * x[1] + 5 * x[2] - 100)
}

result <- nloptr(x0 = x0,
                 eval_f = objective_fn,
                 lb = lower_bounds,
                 ub = upper_bounds,
                 eval_g_eq = constraint_fn, # Equality constraint
                 opts = opts)


result
## 
## Call:
## 
## nloptr(x0 = x0, eval_f = objective_fn, lb = lower_bounds, ub = upper_bounds, 
##     eval_g_eq = constraint_fn, opts = opts)
## 
## 
## Minimization using NLopt version 2.7.1 
## 
## NLopt solver status: 5 ( NLOPT_MAXEVAL_REACHED: Optimization stopped because 
## maxeval (above) was reached. )
## 
## Number of Iterations....: 10001 
## Termination conditions:  xtol_rel: 1e-06 maxeval: 10000 
## Number of inequality constraints:  0 
## Number of equality constraints:    1 
## Current value of objective function:  -49.8315822207714 
## Current value of controls: 4.709713 10.5806

Let’s graph the representation of the problem.

library(ggplot2)

U <- 50
x <- seq(0, 25, 0.1)
y <- U/x # since U = xy
const <- 20 - 2*x # solving the constraint for y

ggplot() + 
  geom_line(aes(x = x, y = y)) + 
  geom_line(aes(x = x, y = const)) + 
  geom_point(aes(x = 5,  y = 10)) +
  coord_fixed(xlim = c(0, 25),
              ylim = c(0, 25)) +
  theme_minimal()

Cost Minimization

Let’s suppose that the firm has to produce 90 units of output \(Q\), with \(w = 21\) and \(r = 3\). Assume that output is produced according to the production function \(Q(L,K) = L^{0.7}K^{0.3}\). Thus, the cost minimization problem is \[ \text{min } 21L + 3K \text{ subject to } 90 = L^{0.7}K^{0.3}\] We will need to change our lower and upper bounds for this problem, so let’s just set them to reasonably large bounds of -500, 500 for both x and y.

lower_bounds <- c(-500, -500)  
upper_bounds <- c(500, 500) 

objective_fn <- function(x) {
  # x[1] represents L and x[2] represents K
  return( 21*x[1] + 3*x[2]) 
}

constraint_fn <- function(x) {
  return(90 - x[1]^(0.7) * x[2]^(0.3))
}

result <- nloptr(x0 = x0,
                 eval_f = objective_fn,
                 lb = lower_bounds,
                 ub = upper_bounds,
                 eval_g_eq = constraint_fn, # Equality constraint
                 opts = opts)


result
## 
## Call:
## 
## nloptr(x0 = x0, eval_f = objective_fn, lb = lower_bounds, ub = upper_bounds, 
##     eval_g_eq = constraint_fn, opts = opts)
## 
## 
## Minimization using NLopt version 2.7.1 
## 
## NLopt solver status: 5 ( NLOPT_MAXEVAL_REACHED: Optimization stopped because 
## maxeval (above) was reached. )
## 
## Number of Iterations....: 10000 
## Termination conditions:  xtol_rel: 1e-06 maxeval: 10000 
## Number of inequality constraints:  0 
## Number of equality constraints:    1 
## Current value of objective function:  1941.86237061024 
## Current value of controls: 64.72944 194.1814

Let’s also graph this cost minimization problem. Labor (L) will be on the x-axis and capital (K) will be on the y-axis.

L <- seq(0, 300, 1)
isoquant <- (90/L^(0.7))^(1/0.3)
isocost <- 1941/3 - 7*L

ggplot() + 
  geom_line(aes(x = L, y = isoquant)) + 
  geom_line(aes(x = L, y = isocost)) +
    geom_point(aes(x = 64.73,  y = 194.18)) +
  coord_fixed(xlim = c(0, 300),
              ylim = c(0, 500)) +
  labs(x = "L", y = "K") +
  theme_minimal()

Tidyverse Basics

Tidyverse is used for data wrangling. It allows you to manipulate data frames in a rather intuitive way. Tidyverse is a huge package so today we will be focusing on functions from the dplyr package (comes with tidyverse).

  • select(): subset columns
  • filter(): subset rows on conditions
  • arrange(): sort results
  • mutate(): create new columns by using information from other columns
  • group_by() and summarize(): create summary statistics on grouped data
  • count(): count discrete values

First, install and load the tidyverse package. Also install the gapminder package; this contains the dataset we will use.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.2.1     ✔ dplyr   1.1.4
## ✔ readr   2.1.2     ✔ stringr 1.5.1
## ✔ purrr   1.0.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ purrr::cross()  masks pracma::cross()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(gapminder)

# look at the dataset
head(gapminder)
## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

Select and Filter

The select() function allows you to alter your data frame by selecting specific columns. For example, if we want to select the country name:

select(gapminder, country)
## # A tibble: 1,704 × 1
##    country    
##    <fct>      
##  1 Afghanistan
##  2 Afghanistan
##  3 Afghanistan
##  4 Afghanistan
##  5 Afghanistan
##  6 Afghanistan
##  7 Afghanistan
##  8 Afghanistan
##  9 Afghanistan
## 10 Afghanistan
## # ℹ 1,694 more rows

If we want to select multiple columns, we can use the c() function:

select(gapminder, c(country, continent))
## # A tibble: 1,704 × 2
##    country     continent
##    <fct>       <fct>    
##  1 Afghanistan Asia     
##  2 Afghanistan Asia     
##  3 Afghanistan Asia     
##  4 Afghanistan Asia     
##  5 Afghanistan Asia     
##  6 Afghanistan Asia     
##  7 Afghanistan Asia     
##  8 Afghanistan Asia     
##  9 Afghanistan Asia     
## 10 Afghanistan Asia     
## # ℹ 1,694 more rows

A quick aside: The tidyverse allows us to chain functions together with %>%. This is called a pipe and the pipe connects the LHS to the RHS (like reading a book). This makes it much easier to use multiple functions in one line of code, and makes it much easier to read your code!

For example, let’s rewrite the above line of code using pipes.

gapminder %>% select(country, continent)
## # A tibble: 1,704 × 2
##    country     continent
##    <fct>       <fct>    
##  1 Afghanistan Asia     
##  2 Afghanistan Asia     
##  3 Afghanistan Asia     
##  4 Afghanistan Asia     
##  5 Afghanistan Asia     
##  6 Afghanistan Asia     
##  7 Afghanistan Asia     
##  8 Afghanistan Asia     
##  9 Afghanistan Asia     
## 10 Afghanistan Asia     
## # ℹ 1,694 more rows

The filter() function extracts particular observations based on a condition. Let’s filter the data frame to contain only rows that correspond to the year 1957.

gapminder %>% filter(year == 1957)
## # A tibble: 142 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1957    30.3  9240934      821.
##  2 Albania     Europe     1957    59.3  1476505     1942.
##  3 Algeria     Africa     1957    45.7 10270856     3014.
##  4 Angola      Africa     1957    32.0  4561361     3828.
##  5 Argentina   Americas   1957    64.4 19610538     6857.
##  6 Australia   Oceania    1957    70.3  9712569    10950.
##  7 Austria     Europe     1957    67.5  6965860     8843.
##  8 Bahrain     Asia       1957    53.8   138655    11636.
##  9 Bangladesh  Asia       1957    39.3 51365468      662.
## 10 Belgium     Europe     1957    69.2  8989111     9715.
## # ℹ 132 more rows

We can also make multiple conditions inside the filter. For an and condition we use & and for an or condition we use |. Try both. Filter the data frame to contain only rows for China in 2007. Then, filter for continent being Asia or Europe.

gapminder %>% filter(country == "China" & year == 2007)
## # A tibble: 1 × 6
##   country continent  year lifeExp        pop gdpPercap
##   <fct>   <fct>     <int>   <dbl>      <int>     <dbl>
## 1 China   Asia       2007    73.0 1318683096     4959.
gapminder %>% filter(continent == "Asia" | continent == "Europe")
## # A tibble: 756 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ℹ 746 more rows

Filter does not have to be an equality either! We can filter based on >, <, >=, and <=. Filter the data frame to contain all years from 2000 onward.

gapminder %>% filter(year >= 2000)
## # A tibble: 284 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       2002    42.1 25268405      727.
##  2 Afghanistan Asia       2007    43.8 31889923      975.
##  3 Albania     Europe     2002    75.7  3508512     4604.
##  4 Albania     Europe     2007    76.4  3600523     5937.
##  5 Algeria     Africa     2002    71.0 31287142     5288.
##  6 Algeria     Africa     2007    72.3 33333216     6223.
##  7 Angola      Africa     2002    41.0 10866106     2773.
##  8 Angola      Africa     2007    42.7 12420476     4797.
##  9 Argentina   Americas   2002    74.3 38331121     8798.
## 10 Argentina   Americas   2007    75.3 40301927    12779.
## # ℹ 274 more rows

Arrange

You use arrange() to sort observations in ascending or descending order of a particular variable. In this case, you’ll sort the data frame based on the lifeExp variable. By default, it sounds from lowest to greatest value (ascending order). To do descending order, wrap in the desc() function

# Ascending
gapminder %>% arrange(lifeExp)
## # A tibble: 1,704 × 6
##    country      continent  year lifeExp     pop gdpPercap
##    <fct>        <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Rwanda       Africa     1992    23.6 7290203      737.
##  2 Afghanistan  Asia       1952    28.8 8425333      779.
##  3 Gambia       Africa     1952    30    284320      485.
##  4 Angola       Africa     1952    30.0 4232095     3521.
##  5 Sierra Leone Africa     1952    30.3 2143249      880.
##  6 Afghanistan  Asia       1957    30.3 9240934      821.
##  7 Cambodia     Asia       1977    31.2 6978607      525.
##  8 Mozambique   Africa     1952    31.3 6446316      469.
##  9 Sierra Leone Africa     1957    31.6 2295678     1004.
## 10 Burkina Faso Africa     1952    32.0 4469979      543.
## # ℹ 1,694 more rows
# Descending
gapminder %>% arrange(desc(lifeExp))
## # A tibble: 1,704 × 6
##    country          continent  year lifeExp       pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Japan            Asia       2007    82.6 127467972    31656.
##  2 Hong Kong, China Asia       2007    82.2   6980412    39725.
##  3 Japan            Asia       2002    82   127065841    28605.
##  4 Iceland          Europe     2007    81.8    301931    36181.
##  5 Switzerland      Europe     2007    81.7   7554661    37506.
##  6 Hong Kong, China Asia       2002    81.5   6762476    30209.
##  7 Australia        Oceania    2007    81.2  20434176    34435.
##  8 Spain            Europe     2007    80.9  40448191    28821.
##  9 Sweden           Europe     2007    80.9   9031088    33860.
## 10 Israel           Asia       2007    80.7   6426679    25523.
## # ℹ 1,694 more rows

Now that we know multiple functions, this is where the pipe operator really comes in handy. We can chain multiple operations together. For example, a filter into an arrange:

gapminder %>%
  filter(year == 1957) %>%
  arrange(desc(pop))
## # A tibble: 142 × 6
##    country        continent  year lifeExp       pop gdpPercap
##    <fct>          <fct>     <int>   <dbl>     <int>     <dbl>
##  1 China          Asia       1957    50.5 637408000      576.
##  2 India          Asia       1957    40.2 409000000      590.
##  3 United States  Americas   1957    69.5 171984000    14847.
##  4 Japan          Asia       1957    65.5  91563009     4318.
##  5 Indonesia      Asia       1957    39.9  90124000      859.
##  6 Germany        Europe     1957    69.1  71019069    10188.
##  7 Brazil         Americas   1957    53.3  65551171     2487.
##  8 United Kingdom Europe     1957    70.4  51430000    11283.
##  9 Bangladesh     Asia       1957    39.3  51365468      662.
## 10 Italy          Europe     1957    67.8  49182000     6249.
## # ℹ 132 more rows

Mutate

We can create new variables with mutate(). Suppose we want total GDP, instead of GDP per capita.

gapminder %>% mutate(gdp = gdpPercap * pop)
## # A tibble: 1,704 × 7
##    country     continent  year lifeExp      pop gdpPercap          gdp
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
##  2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
##  3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
##  4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
##  5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
##  6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.
##  7 Afghanistan Asia       1982    39.9 12881816      978. 12598563401.
##  8 Afghanistan Asia       1987    40.8 13867957      852. 11820990309.
##  9 Afghanistan Asia       1992    41.7 16317921      649. 10595901589.
## 10 Afghanistan Asia       1997    41.8 22227415      635. 14121995875.
## # ℹ 1,694 more rows

Next, let’s combine what we know so far. Find the countries with the highest life expectancy, in months, in the year 2007.

gapminder %>% filter(year == 2007) %>% 
  mutate(lifeExpMonths = lifeExp * 12) %>% 
  arrange(desc(lifeExpMonths))
## # A tibble: 142 × 7
##    country          continent  year lifeExp       pop gdpPercap lifeExpMonths
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>         <dbl>
##  1 Japan            Asia       2007    82.6 127467972    31656.          991.
##  2 Hong Kong, China Asia       2007    82.2   6980412    39725.          986.
##  3 Iceland          Europe     2007    81.8    301931    36181.          981.
##  4 Switzerland      Europe     2007    81.7   7554661    37506.          980.
##  5 Australia        Oceania    2007    81.2  20434176    34435.          975.
##  6 Spain            Europe     2007    80.9  40448191    28821.          971.
##  7 Sweden           Europe     2007    80.9   9031088    33860.          971.
##  8 Israel           Asia       2007    80.7   6426679    25523.          969.
##  9 France           Europe     2007    80.7  61083916    30470.          968.
## 10 Canada           Americas   2007    80.7  33390141    36319.          968.
## # ℹ 132 more rows

Group By and Summarize

This will group data together and can make summary statistics. You can use mean, sum, median, etc. Let’s find the median life expectancy in the data set.

gapminder %>% summarize(median(lifeExp))
## # A tibble: 1 × 1
##   `median(lifeExp)`
##               <dbl>
## 1              60.7

Note that this gives the median for the entire dataset. We can use group_by() to group the data first, and then summarize. For example, if we want the median life expectancy per year:

gapminder %>% group_by(year) %>% summarize(median(lifeExp))
## # A tibble: 12 × 2
##     year `median(lifeExp)`
##    <int>             <dbl>
##  1  1952              45.1
##  2  1957              48.4
##  3  1962              50.9
##  4  1967              53.8
##  5  1972              56.5
##  6  1977              59.7
##  7  1982              62.4
##  8  1987              65.8
##  9  1992              67.7
## 10  1997              69.4
## 11  2002              70.8
## 12  2007              71.9

You can also group by multiple items. If we want the median by year and by continent:

gapminder %>% group_by(year, continent) %>% summarize(median(lifeExp))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 60 × 3
## # Groups:   year [12]
##     year continent `median(lifeExp)`
##    <int> <fct>                 <dbl>
##  1  1952 Africa                 38.8
##  2  1952 Americas               54.7
##  3  1952 Asia                   44.9
##  4  1952 Europe                 65.9
##  5  1952 Oceania                69.3
##  6  1957 Africa                 40.6
##  7  1957 Americas               56.1
##  8  1957 Asia                   48.3
##  9  1957 Europe                 67.6
## 10  1957 Oceania                70.3
## # ℹ 50 more rows

You can also summarize multiple variables at once. Let’s summarize the mean gdp per capita as well.

gapminder %>% group_by(year, continent) %>% summarize(median(lifeExp), mean(gdpPercap))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 60 × 4
## # Groups:   year [12]
##     year continent `median(lifeExp)` `mean(gdpPercap)`
##    <int> <fct>                 <dbl>             <dbl>
##  1  1952 Africa                 38.8             1253.
##  2  1952 Americas               54.7             4079.
##  3  1952 Asia                   44.9             5195.
##  4  1952 Europe                 65.9             5661.
##  5  1952 Oceania                69.3            10298.
##  6  1957 Africa                 40.6             1385.
##  7  1957 Americas               56.1             4616.
##  8  1957 Asia                   48.3             5788.
##  9  1957 Europe                 67.6             6963.
## 10  1957 Oceania                70.3            11599.
## # ℹ 50 more rows

Graphing data with GGplot

We have already used ggplot to make several graphs earlier in this course. However, earlier we were graphing equations and functions. Here, we will learn how to visualize our data from the data frame.

Let’s use the year 2007 to make a scatterplot of the GDP per capita and life expectancy by country.

gapminder_2000 <- gapminder %>%
  filter(year == 2007)

ggplot(gapminder_2000, aes(x = gdpPercap, y = lifeExp)) + geom_point() 

Since some countries have a very high GDP per capita, we might want to change our x-axis to a log scale.

ggplot(gapminder_2000, aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() +   
  scale_x_log10()

Facet Wrapping

Faceting divides a graph into subplots based on one of its variables, such as the continent.

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  scale_x_log10() +
  facet_wrap(~continent)

Other Graphs

You can also visualize summary stats. Let’s make a line plot of the median life expectancy over time.

medianlife <- gapminder %>% group_by(year) %>% summarize(med = median(lifeExp))

ggplot(medianlife, aes(x = year, y = med)) + geom_line()

Next, let’s use a bar plot to show the gdp per capita in 2007 by continent.

gapminder_07 <- gapminder %>% filter(year == 2007)

ggplot(gapminder_07, aes(x = continent, y = gdpPercap)) + geom_col()

Similarly, we can make histograms of a variable using geom_histogram(). Here is a histogram of the GDP per capita in 2007.

ggplot(gapminder_07, aes(x = gdpPercap)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Box plots show the distribution as well.

ggplot(gapminder_07, aes(x = continent, y = gdpPercap)) +
  geom_boxplot() +
  scale_y_log10()

Integrals

Definite Integrals

We use the base function integrate() to compute definite integrals.

  • Example 1: \(\int_1^4 x^2 dx\)
f <- function(x){x^2}
integrate(f, lower = 1, upper = 4)
## 21 with absolute error < 2.3e-13
  • Example 2: \(\int_1^3 e^x - x^2 dx\)
f <- function(x){exp(x) - x^2}

integrate(f, lower = 1, upper = 3)
## 8.700588 with absolute error < 9.7e-14
  • Example 3: \(\int_1^{\infty} \frac{1}{x^2} dx\)
f <- function(x){1/x^2}

integrate(f, lower = 1, upper = Inf)
## 1 with absolute error < 1.1e-14

Indefinite Integrals

We can compute indefinite integrals with the antiD() function from the mosaicCalc package.

  • Example 1: \(\int 4x^3 dx\)
library(mosaicCalc)

antiD(4*x^3 ~ x)
## function (x, C = 0) 
## x^4 + C
  • Example 2: \(\int 6x^{\frac{1}{2}}dx\)
antiD(6*x^(1/2) ~ x)
## function (x, C = 0) 
## 4 * sqrt(x)^3 + C
  • Example 3: \(\int 4(3x-5)^3 dx\)
antiD(4*(3*x - 5)^3 ~ x)
## function (x, C = 0) 
## 27 * x^4 - 180 * x^3 + 450 * x^2 - 500 * x + C

Application: MC and TC

Suppose we want to find the change in total cost for a firm with \(MC = 0.027Q^2 - Q + 15\) from Q = 10 to Q = 20.

mc <- function(Q){0.027*Q^2 - Q + 15}

integrate(mc, 10, 20)
## 63 with absolute error < 7e-13

The change in total cost is 63.

Application: Consumer and Producer Suprlus

The conumer surplus is given by: \[ \int_0^{Q_0} P(Q)dQ - P_0Q_0\]

The producer suprlus is given by: \[P_0Q_0 - \int_0^{Q_0} MC(Q)dQ\]

Assume that the demand and supply equations are \(P(Q) = -2Q + 21\) and \(MC(Q) = Q + 3\). We then need to find \(P_0\) and \(Q_0\), the equilibrium price and quantity.

# create demand and supply functions

demand <- function(Q){-2*Q + 21}
supply <- function(Q){Q + 3}

# find equilibrium value for Q

Q0 <- uniroot(function(Q){demand(Q) - supply(Q)},
              c(0, 25))$root
Q0
## [1] 6
# plug into either demand or supply to get price
P0 <- Q0 + 3
P0
## [1] 9

Now we can integrate.

cs <- integrate(demand, 0, Q0)$value - P0*Q0
cs
## [1] 36
ps <- P0*Q0 - integrate(supply, 0, Q0)$value
ps
## [1] 18

Let’s make a graph illustrating the consumer and producer suprlus. First, put values into a data frame.

Q <- seq(0, 25, by = .1)
D <- -2*Q + 21
S <- Q + 3

df <- data.frame(Q, D, S, Q0, P0)

Then, we can make the graph.

ggplot(df) + geom_line(aes(Q, D)) + geom_line(aes(Q, S)) + 
  geom_ribbon(data = subset(df, 0 <= Q & Q <= 6), aes(Q, ymin = P0, ymax = D), fill = "yellow") +
  geom_ribbon(data = subset(df, 0 <= Q & Q <= 6), aes(Q, ymin = S, ymax = P0), fill = "green") +
  theme_minimal() +
  labs(x = "Quantity", y = "Price") +
  ylim(0,25) +
  geom_vline(xintercept = 0) +
  geom_hline(yintercept = 0)