Here you can find lab notes and resources for Econ 401/511 Fall 2024. These will be updated after our in-class lab sessions. These notes are not a substitute for attending lab but serve as an additional resource.
Much of the lab content will be drawn from R for Data Science and from “Introduction to Mathematics for Economics with R” by Massimiliano Porto.
Getting Started with R – A collection of resources for those getting started with R
TidyR Cheatsheet – A useful cheatsheet for data cleaning and tidy data using the tidyverse functions
ggplot2 Cheatsheet – A useful cheatsheet for various ggplot geoms
Tidy Tuesday Repo – A weekly data science project in R to test your tidyverse skills!
Effective file folder management is crucial for maintaining an organized and efficient digital workspace. Setting up organized folders will make your life significantly easier in the future!
R can be installed on Windows, Mac, and Linux systems. You can download R here by selecting “CRAN” from the sidebar, choosing a mirror close to your location, and choosing the operating system you have: Download R
You will also need to install RStudio. RStudio is an integrated development enviroment (IDE) that makes working with R much more user friendly. You can download RStudio here.
When you want to use R, you will open RStudio to do your coding. You do not need to open R itself.
In the bottom left of RStudio, you will see the console. The console executes code. You can type code and execute it using the console but the code is not saved when you close R Studio. It is recommended that you do not use the console in your regular workflow.
To save your work, you should code in an R Script. Open a script using the button that looks like a piece of paper with a green plus sign in the top left corner of R Studio.
R scripts will open here. You can code, comment, and run the code from your script. To run the code, either click the “Run” button or by pressing CMD+Enter (Mac) or Ctrl+Enter (Windows). R scripts will be saved to the folder you are currently working in.
In the top left corner, we have the workspace/environment panes.
The workspace/enviroment tab tells you what objects are stored in R (i.e. what is loaded or stored in memory). The History tab which shows previous commands you have run.
Last, on the bottom right, we have several tabs including:
When using R, you will almost certainly work with packages. A package is a collection of functions, data, and compiled code that is bundled together for easy sharing and reuse. Packages extend the functionality of R by providing tools for specific tasks, such as data manipulation, statistical modeling, or visualization. Users can install packages from repositories like CRAN (Comprehensive R Archive Network) or GitHub, and once installed, they can be loaded into an R session using the library()
function to access the package’s features.
For example, let’s install the ggplot2
package. There are two ways to do this. First, we can write a line of code:
#install.packages("ggplot2")
Alternatively, you can install packages in RStudio by clicking on the packages tab
and then typing the name of the package in the pop-up window.
Once a package is installed, you need to load the package using library()
. You will need to load the package you want to use anytime you start a new R session.
library(ggplot2)
Sometimes, it can be burdensome to use the library function when you have many packages to load. The pacman
package helps with this by allowing you to load multiple packages in two lines of code! Install and load the pacman
package, then use p_load()
to load packages–you can load multiple at a time by separating the names with a comma.
library(pacman)
p_load(ggplot2)
R is an object-oriented programming language, meaning that R organizes its code and data into objects. These objects can represent anything from simple data structures like numbers and strings to more complex things like models, plots, or datasets. The assignment operator <- is used to assign values to objects.
x <- 5 #assign the value of 5 to a variable called x
# notice that this x is now in your global environment
x # print x
## [1] 5
y = 10
y
## [1] 10
You can combine objects together as well which lets us do some basic math operations.
# create a new object called z that is equal to x*y
z <- x * y
#print z
z
## [1] 50
If you do not create an object, R will not save it to the global environment. If an object is not in the global environment and you try to reference it later, R will not know what you are referring to.
In R, there are different types of objects. We can check the type of object with the class()
function.
class(x)
## [1] "numeric"
There are other classes too!
# create a string
my_string <- "Economics is my favorite subject!"
class(my_string)
## [1] "character"
# logical class
class(2>3)
## [1] "logical"
What happens if we have a vector of characters and numbers?
char_vector <- c(1:5, "banana", "apple")
char_vector
## [1] "1" "2" "3" "4" "5" "banana" "apple"
#cant use mathematical operations on characters
# why?? because the entire vector is a character class!
class(char_vector)
## [1] "character"
We can coerce objects into other classes. For example, let’s say we have an objects below:
a <- 5
b <- "2"
If we try to combine these objects into a new one called ab
that is equal to a*b, we will get an error. Why? Because a
is a numeric object and b
is a character object.
class(a)
## [1] "numeric"
class(b)
## [1] "character"
We can coerce b
into numeric using the as.numeric()
function and then we can perform the operation.
b <- as.numeric(b)
ab <- a*b
ab
## [1] 10
Other functions to coerce a class change are: - as.integer()
- as.character()
- as.data.frame()
- etc.
Note that R does not know how to convert a string of letters into numeric and will give us a warning message. For example,
my_string <- as.numeric(my_string)
## Warning: NAs introduced by coercion
The c()
function is used to concatenate items separated by a comma.
d <- c(1, 2, 3, 4, 5)
d
## [1] 1 2 3 4 5
e <- c("a", "b", "c", "d", "e")
e
## [1] "a" "b" "c" "d" "e"
You can also concatenate previously generated objects. Note that the c()
function cannot store items with different classes and R will coerce the different types to one type.
dab <- c(d, a, b)
dab
## [1] 1 2 3 4 5 5 2
de <- c(d, e)
de
## [1] "1" "2" "3" "4" "5" "a" "b" "c" "d" "e"
The list()
function can be used to store objects in a single object keeping their original class characterisitics.
de_list <- list(d,e)
de_list
## [[1]]
## [1] 1 2 3 4 5
##
## [[2]]
## [1] "a" "b" "c" "d" "e"
The square bracket [ ]
allows us to subset, extract, or replace a part of an object. For example, we can select the first entry in our object e
e[1]
## [1] "a"
Running e
again will show that no modification has been made to the object. We can use the square bracket to replace parts of the object.
e[1] <- "m"
e
## [1] "m" "b" "c" "d" "e"
We can also combine the c()
operator and the square bracket to subset more than one value.
e[c(1,3)]
## [1] "m" "c"
Now, let’s apply the square bracket operator to a two-dimensional object: a data frame.
df <- data.frame(numbers = d,
letters = e)
df
## numbers letters
## 1 1 m
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
The structure of our object df
is rows per columns. We can use the square bracket operator but need to specify the row and the column.
df[4, 2]
## [1] "d"
This gives us the item in row 4 and column 2. If we want to select all of the rows for the first column, we can leave the row blank. Likewise, we can do the same for columns in row 1.
df[, 1]
## [1] 1 2 3 4 5
df[1, ]
## numbers letters
## 1 1 m
Last, we can replace elements in the data frame with the square bracket operator.
df[1,1] <- 10
df
## numbers letters
## 1 10 m
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
Coding style is the punctuation and grammer of the coding world. Making sure that your code is formatting in a readable, standard format is helpful for yourself and others to understand the code. We will follow guidelines from the tidyverse style guide
Spaces: Put spaces on either side of mathematical operators (i.e. +, -, ==, <, …), and around the assignment operator (<-). The exception to this is the ^ symbol.
# Strive for
z <- (a + b)^2 / ab
# Avoid
z<-( a + b ) ^ 2/ab
Don’t put spaces inside or outside parentheses for regular function calls. Always put a space after a comma.
# Strive for
mean(x, na.rm = TRUE)
## [1] 5
# Avoid
mean (x ,na.rm=TRUE)
## [1] 5
Adding extra spaces is fine if it helps with alignment. For example:
example_data_frame <-
data.frame(
variable1 = c(1:10),
variable_name2 = c(2:11),
var_name = c(3:12)
)
Naming Conventions: Object names must start with a letter and can only contain letters, numbers, _
, and .
. The names should be descriptive–snake case is the recommended naming convention (separating lowercase words with _
).
really_long_variable_name <- 1
Commenting: You can comment your code with #
. It is strongly recommended to leave comments in your code so that others, and future you, can keep track of your thought process.
# Good code is well-commented code!!
You can also create section comments that will be collapsable. This is incredibly helpful when you have a really long R script! Any comment line which includes at least four trailing dashes (-) will create a section. Sections will appear in the table of contents in your R script (above the Console) and allows for quick navigation.
# This is section 1 ----
# ---- This is section 2 ----
The basic setup of making a ggplot requires three things: the data, the aesthetic mapping, and a geom. The aesthetic mappings describe how variables in the data are mapped to visual properties (aesthetics) of geoms, like which variables are on the axes, the variable to color or fill by, etc. The geoms tell R how to draw the data like points, lines, columns, etc.
In general, we can make a ggplot by typing the following: ggplot(data = ) +
The way ggplot works is by adding layers. We can add a new layer with the + sign. Let’s build a ggplot step by step. First, start with ggplot() and tell R what data we are using.
df <- data.frame(xpoints = seq(-10, 10, 1), ypoints = seq(-10, 10, 1))
ggplot(data = df)
Why did this make a blank graph? Well, we haven’t given R the aesthetic mapping yet so it doesn’t know what to put on top of the base layer. Let’s add the x and y variables.
ggplot(data = df, mapping = aes(x = xpoints, y = ypoints)) # note you can put the mapping here or in the geom
Now we have a graph with axes and gridlines but no information on the graph. To get data on the graph, we need to tell R how we want to draw the data with a geom. To make a scatterplot, we use geom_point().
ggplot(data = df, mapping = aes(x = xpoints, y = ypoints)) + geom_point()
Ggplot allows even more layering. We can add a line through these points with geom_line()
.
ggplot(data = df, mapping = aes(x = xpoints, y = ypoints)) + geom_point() + geom_line()
First, we need some data to start with. Let’s generate an x
variable from -10 to 10. Then, graph the equation \(y = 4 + 3x\).
df <- data.frame(x = seq(-10, 10, 1))
ggplot(df, aes(x = x)) +
stat_function(fun = function(x) 4 + 3*x, color = "blue")
Now, let’s make the graph look a bit nicer and add the functions \(y = 3x\) and \(y = -4 + 3x\).
ggplot(df, aes(x = x)) +
stat_function(fun = function(x) 4 + 3*x, color = "blue") +
stat_function(fun = function(x) 3*x, color = "green") +
stat_function(fun = function(x) -4 + 3*x, color = "red") +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
theme_minimal() +
ggtitle("Graph of Linear Functions")
Let’s start by graphing a firm’s cost functions. We can decompose a firm’s total cost into fixed cost (FC) and variable cost (VC). Assume the firm has a fixed cost of $5000 and a variable cost of $125 per output. Let’s graph the total cost.
x <- seq(0, 50, 1)
FC <- 5000
VC <- 125
TC <- FC + VC*x
df <- data.frame(output = x,
total_cost = TC)
head(df)
## output total_cost
## 1 0 5000
## 2 1 5125
## 3 2 5250
## 4 3 5375
## 5 4 5500
## 6 5 5625
ggplot(df, aes(x = output, y = total_cost)) +
geom_line(color = "blue") +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
labs(x = "Output",
y = "Total Cost") +
theme_minimal()
Next, let’s graph profit, revenue, and cost equations to find the break-even profit. Assume the firm sells the product for $250 each and has the total cost function from above.
p <- 250
R <- p*x
pi <- R - TC
df <- cbind(df, revenue = R, profit = pi)
head(df)
## output total_cost revenue profit
## 1 0 5000 0 -5000
## 2 1 5125 250 -4875
## 3 2 5250 500 -4750
## 4 3 5375 750 -4625
## 5 4 5500 1000 -4500
## 6 5 5625 1250 -4375
ggplot(df) +
geom_line(aes(x = output, y = total_cost, color = "total_cost")) +
geom_line(aes(x = output, y = revenue, color = "revenue")) +
geom_line(aes(x = output, y = profit, color = "profit")) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
labs(x = "Output", y = "") +
theme_minimal() +
theme(legend.title = element_blank())
The breakeven profit is where the firm sells 40 units of output.
We can use similar code to graph other types of functions as well. For example, the stat_function
can be used to graph quadratic functions:
df <- data.frame(x = seq(-10, 10, by = 0.1))
ggplot(df, aes(x = x)) +
stat_function(fun = function(x) x^2, color = "blue") +
stat_function(fun = function(x) x^2 + 5, color = "orange") +
stat_function(fun = function(x) 4*x^2, color = "red") +
stat_function(fun = function(x) 4*x^2 + 10, color = "purple") +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
theme_minimal() +
ggtitle("Graph of Quadratic Functions")
Suppose we have the cost function \(C(x) = 0.01x^2 + x + 10\). Let’s plot the total cost, fixed cost, variable cost, and the average cost.
x <- seq(0, 50, 1)
FC <- 10
VC <- 0.01*x^2 + x
TC <- FC + VC
df <- data.frame(output = x,
total_cost = TC,
fixed_cost = FC,
variable_cost = VC
)
head(df)
## output total_cost fixed_cost variable_cost
## 1 0 10.00 10 0.00
## 2 1 11.01 10 1.01
## 3 2 12.04 10 2.04
## 4 3 13.09 10 3.09
## 5 4 14.16 10 4.16
## 6 5 15.25 10 5.25
Next, we need to compute the average cost and add it to our data frame. Recall, average cost is the total cost divided by output.
df$average_cost <- df$total_cost / df$output
head(df)
## output total_cost fixed_cost variable_cost average_cost
## 1 0 10.00 10 0.00 Inf
## 2 1 11.01 10 1.01 11.010000
## 3 2 12.04 10 2.04 6.020000
## 4 3 13.09 10 3.09 4.363333
## 5 4 14.16 10 4.16 3.540000
## 6 5 15.25 10 5.25 3.050000
Oh no! the first value of average cost is undefined… we should remove that.
df <- df[-1,]
head(df)
## output total_cost fixed_cost variable_cost average_cost
## 2 1 11.01 10 1.01 11.010000
## 3 2 12.04 10 2.04 6.020000
## 4 3 13.09 10 3.09 4.363333
## 5 4 14.16 10 4.16 3.540000
## 6 5 15.25 10 5.25 3.050000
## 7 6 16.36 10 6.36 2.726667
Much better! Let’s also do one more cool trick to make graphing a bit easier for us. What if instead of specifying separate groups and writing functions three or more times (like we did before!), we could just do one line of code in our ggplot? To do this, we need to reshape the data from wide to long.
df_long <- reshape(
df,
varying = c("total_cost", "fixed_cost", "variable_cost", "average_cost"),
v.names = "cost", #new column for the cost value
timevar = "cost_type",#new column for the type of cost
times = c("total_cost", "fixed_cost", "variable_cost", "average_cost"), #names of the types of cost
idvar = "output",
direction = "long"
)
head(df_long)
## output cost_type cost
## 1.total_cost 1 total_cost 11.01
## 2.total_cost 2 total_cost 12.04
## 3.total_cost 3 total_cost 13.09
## 4.total_cost 4 total_cost 14.16
## 5.total_cost 5 total_cost 15.25
## 6.total_cost 6 total_cost 16.36
tail(df_long)
## output cost_type cost
## 45.average_cost 45 average_cost 1.672222
## 46.average_cost 46 average_cost 1.677391
## 47.average_cost 47 average_cost 1.682766
## 48.average_cost 48 average_cost 1.688333
## 49.average_cost 49 average_cost 1.694082
## 50.average_cost 50 average_cost 1.700000
Notice we have a long-form data frame with a column that specifies the type of cost. This was not necessary to do but will make graphing MUCH easier!
ggplot(df_long, aes(x = output, y = cost, group = cost_type, color = cost_type)) +
geom_line() +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
theme_minimal() +
ggtitle("Firm Cost Functions")
Before we talk about functions in R, we should first make sure we understand what a function is in general. You all are probably used to thinking of something like:
`f(x)= x^2`
But what really does this even mean? Why is this a useful tool? - It takes some input (a number), and gives us an output (in this case, the squared number, which is indeed also a number). We can represent the function \(f\) above as: Number -> ADifferentNumber
where that arrow is something we call a mapping. A mapping links one object to another. This doesn’t have to be two numbers, it could be a word and a dataframe, or two words. For instance, I could come up with a function called color labeler
and then pass it some object, which it will map to a color.
color_labeler(banana) = yellow
Where color_labeler is the name of the function, that maps fruits(inputs) to colors(outputs)
Knowing all of this, we need to go back to lecture 1 to remind ourselves about R-facts before we make a function.
name
, and everything is an object
inputs
/arguments
and outputs
With these in mind, let’s write a function that will take some x, and spit out x^2 + 10. That is, let’s write the code for f(x)=x^2+10.
Everything in R needs a name That includes our function.
We do this by starting with a name, setting it equal to a ‘function()’ function.
In general, this function looks like function([some_set_of_arguments]){your function code}
. function()
is a special operator that takes any arguments you want in the parentheses, and then lets you manipulate them in any way you see fit. Think of the parentheses here as the toys you’re giving your computer to play with inthe sandbox.
squarePlusten = function(x){
#tell squarePlusten what to do. x is an input here, we can tell our function to transform our variable into
#something else.
x_squaredten = x^2+10
#Now, in order to make use of this value, we need our function to spit something out.
#We do this with another special function, 'return()'.
return(x_squaredten)
}
notice however: x_squaredten isn’t equal to anything now that our function has been run. That’s because x_squaredten is only defined in the context of your written function.
x_squaredten
Now, we can use the function by calling the function name and giving it an input:
#check for 10. should be 110
squarePlusten(10)
## [1] 110
A couple things to note:
Remember, functions don’t just change numbers to numbers. A more accurate way to think of functions is as a map from one object to another. The starting point is the argument, and the end point is the output. The functions take an object as an input.
Now, let’s write a function to graph tangent lines (we will use this later). We will need the packages tidyr
and ggplot2
, so be sure to install and load those. The function rearranges and plots data. We use pivot_longer()
to reshape all the data except column x. The %>%
operator is from tidyr that pipes an object forward into a function.
library(tidyr)
library(ggplot2)
tangent_line <- function(df_fn, XLIM = NULL, YLIM = NULL, XLAB = "x", YLAB = "y"){
df1 <- df_fn %>% pivot_longer(!x)
g <- ggplot() +
geom_line(data = df1, aes(x = x, y = value, group = name, color = name)) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
coord_cartesian(xlim = XLIM, ylim = YLIM) +
labs(x = XLAB, y = YLAB) +
theme_minimal()
return(g)
}
We can compute derivatives with R by using the Deriv()
function in the Deriv
package. Be sure to install and load the package.
library(Deriv)
Let’s find the derivative of the function \(y = x^2\). First, we generate an expression containing the derivative we want to compute, stored in the object y
.
y <- expression(x^2)
This expression will be the the first entry of the Deriv()
function; the second entry is a character vector that tells the function which variable to take the derivative with respect to.
dydx <- Deriv(y, "x")
dydx
## expression(2 * x)
Let’s now compute the derivative of \(y = 2x^2 + 3x^3\).
y <- expression(2*x^2 + 3*x^3)
dydx <- Deriv(y, "x")
dydx
## expression(x * (4 + 9 * x))
Let’s try a log and an exponential function.
y <- expression(log(x))
dydx <- Deriv(y, "x")
dydx
## expression(1/x)
y <- expression(exp(x))
dydx <- Deriv(y, "x")
dydx
## expression(exp(x))
We can compute the second derivative by adding the argument nderiv
.
y <- expression(x^2)
d2ydx2 <- Deriv(y, "x", nderiv = 2)
d2ydx2
## expression(2)
Let’s use our tangent_line()
function from earlier. Suppose we have the function \(y = x^3 - 4x^2 + x + 6\) and want to graph the tangent lines at the points (5, 36), and (-2, -20). First, find the derivative and plug in the values of x.
y <- expression(x^3 - 4*x^2 + x + 6)
dydx <- Deriv(y, "x")
dydx
## expression(1 + x * (3 * x - 8))
The derivative is equal to \(3x^2 - 8x + 1\). At (5, 36) the equation of the tangent line is \(y = 36x - 144\) and at (-2, -20) the tangent line is \(y = 29x + 38\).
x <- seq(-10, 10, 0.1)
y <- x^3 - 4*x^2 + x + 6
tg1 <- 36*x - 144
tg2 <- 29*x + 38
df <- data.frame(x, y, tg1, tg2)
tangent_line(df, XLIM = c(-10, 10), YLIM = c(-40, 40))
Given the following total cost function \(TC = VC_3 Q^3 - VC_2 Q^2 + VC_1 Q + FC\), the marginal cost is the derivative.
TC <- expression(0.009*Q^3 - 0.5*Q^2 + 15*Q + 35)
MC <- Deriv(TC, "Q")
MC
## expression(15 + Q * (0.027 * Q - 1))
To find the marginal cost at specific values of Q, we need to specify a sequence of output values and use the eval
function.
Q <- seq(0, 50 , by = 1)
MC <- eval(parse(text = MC))
head(MC)
## [1] 15.000 14.027 13.108 12.243 11.432 10.675
Now, let’s code functions to compute the total cost, marignal cost, and the y-intercept of a linear function.
# total cost function
total_cost <- function(Q, VC1, VC2, VC3, FC){
TC <- VC3*Q^3 + VC2*Q^2 + VC1*Q + FC
return(TC)
}
# marginal cost function
marginal_cost <- function(Q, VC1, VC2, VC3, FC){
TC <- expression(VC3*Q^3 + VC2*Q^2 + VC1*Q + FC)
MC <- Deriv(TC, "Q")
return(eval(parse(text = MC)))
}
# y-intercept function
y_intercept <- function(TC, MC, Q){
a <- TC - MC*Q
return(a)
}
Now, we can find the tangent lines to the cost function. Let’s set \(Q = 10\) and \(Q = 45\).
# total costs
TC10 <- total_cost(10, 15, -0.5, 0.009, 35)
TC10
## [1] 144
TC45 <- total_cost(45, 15, -0.5, 0.009, 35)
TC45
## [1] 517.625
# marginal costs
MC10 <- marginal_cost(10, 15, -0.5, 0.009, 35)
MC10
## [1] 7.7
MC45 <- marginal_cost(45, 15, -0.5, 0.009, 35)
MC45
## [1] 24.675
# y-intercepts
a10 <- y_intercept(TC10, MC10, 10)
a45 <- y_intercept(TC45, MC45, 45)
# tangent lines
tg10 <- a10 + MC10*Q
tg45 <- a45 + MC45*Q
We want all of the total cost and marginal cost values for our sequence of output. We can use the functions to do this!
Q <- seq(0, 50 , by = 1)
tc <- total_cost(Q, 15, -0.5, 0.009, 35)
mc <- marginal_cost(Q, 15, -0.5, 0.009, 35)
df <- data.frame(x = Q,
total_cost = tc,
maringal_cost = mc,
tangent10 = tg10,
tangent45 = tg45)
head(df)
## x total_cost maringal_cost tangent10 tangent45
## 1 0 35.000 15.000 67.0 -592.750
## 2 1 49.509 14.027 74.7 -568.075
## 3 2 63.072 13.108 82.4 -543.400
## 4 3 75.743 12.243 90.1 -518.725
## 5 4 87.576 11.432 97.8 -494.050
## 6 5 98.625 10.675 105.5 -469.375
tangent_line(df, XLIM = NULL, YLIM = c(0, 600), XLAB = "Output", YLAB = "Cost")
Let’s add the average cost for this firm. Adding the average cost to the data frame is the same produre as we have seen previously.
df2 <- data.frame(x = Q,
marginal_cost = mc)
average_cost <- tc/Q
df2 <- cbind(df2, average_cost)
df2 <- df2[-1,]
head(df2)
## x marginal_cost average_cost
## 2 1 14.027 49.50900
## 3 2 13.108 31.53600
## 4 3 12.243 25.24767
## 5 4 11.432 21.89400
## 6 5 10.675 19.72500
## 7 6 9.972 18.15733
Now, let’s make the plot.
names(df2) <- c("output", "MC", "AC")
ggplot(df2) +
geom_line(aes(x = output, y = AC), color = "blue") +
geom_line(aes(x = output, y = MC), color = "red") +
theme_minimal()
In R, we build a matrix using the matrix()
function. The first entry is the data, nrow
is the number of rows, and ncol
is the number of columns. the argument byrow
can be set to TRUE, which fills the matrix by columns, or FALSE (the default), which fills the matrix by rows.
For example,
A <- matrix(c(1, 2, 3,
4, 5, 6),
nrow = 2,
ncol = 3,
byrow = FALSE)
A
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
gives the matrix \[ \mathbf{A} = \begin{bmatrix} 1 & 3 & 5 \\ 2 & 4 & 6 \\ \end{bmatrix} \] and
A <- matrix(c(1, 2, 3,
4, 5, 6),
nrow = 2,
ncol = 3,
byrow = TRUE)
A
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
gives the matrix \[ \mathbf{A} = \begin{bmatrix} 1 & 3 & 5 \\ 2 & 4 & 6 \\ \end{bmatrix} \]
Recall from lecture, addition of matrices requires the matrices to have the same size. We define two matrices: \[ \mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \\ \end{bmatrix} \] \[ \mathbf{B} = \begin{bmatrix} -2 & 3 \\ 5 & -1 \\ 2 & 2 \\ \end{bmatrix} \] To add these in R:
A <- matrix(c(1, 2,
3, 4,
5, 6),
nrow = 3,
ncol = 2,
byrow = T)
B <- matrix(c(-2, 3,
5, -1,
2, 2),
nrow = 3,
ncol = 2,
byrow = T)
A + B
## [,1] [,2]
## [1,] -1 5
## [2,] 8 3
## [3,] 7 8
For scalar multiplication, we can specify a constant and multiply it with a matrix object using *
.
6*A
## [,1] [,2]
## [1,] 6 12
## [2,] 18 24
## [3,] 30 36
Recall, from lecture, that to multiply two matrices the column dimension of the first matrix must be equal to the row dimension of the second matrix. In R, we compute matrix multiplication with %*%
For example, suppose we have \[ \mathbf{A} = \begin{bmatrix} 2 & 5\\ \end{bmatrix} \] and \[ \mathbf{B} = \begin{bmatrix} -2 & 3 &6 \\ 9 & 0 & 2 \\ \end{bmatrix} \] Can we multiply these matrices? Yes, A is a a 1x2 matrix and B is a 2x3. We should get a 1x3 matrix after multiplying.
A <- matrix(c(2, 5),
nrow = 1,
ncol = 2,
byrow = T)
B <- matrix(c(-2, 3, 6,
9, 0, 2),
nrow = 2,
ncol = 3,
byrow = T)
A %*% B
## [,1] [,2] [,3]
## [1,] 41 6 22
Now suppose we have the matrices \[ \mathbf{A} = \begin{bmatrix} -2 & 3 & 4 & 6\\ 4 & -4 & 3 & 0 \\ 1 & 8 & 5 & 3\\ \end{bmatrix} \] \[ \mathbf{B} = \begin{bmatrix} 1 & -2 \\ 5 & 3 \\ 5 & 4 \\ 7 & 8 \\ \end{bmatrix} \] We can multiply these in R:
A <- matrix(c(-2, 3, 4, 6,
4, -4, 3, 0,
1, 8, 5, 3),
nrow = 3,
ncol = 4,
byrow = T)
B <- matrix(c(-1, 2,
5, 3,
5, 4,
7, 8),
nrow = 4,
ncol = 2,
byrow = T)
A %*% B
## [,1] [,2]
## [1,] 79 69
## [2,] -9 8
## [3,] 85 70
Suppose we have the system of equations: \[ x + y = 4 \]
\[ 2x + y = 7 \] Recall, we can use matrices to write systems of linear equations in \(Ax = d\) format. For this system, \[ \begin{bmatrix} 1 & 1 \\ 2 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ \end{bmatrix} = \begin{bmatrix} 4 \\ 7 \\ \end{bmatrix} \]
To solve the system in R, we use the solve()
function.
A <- matrix(c(1, 1,
2, 1),
nrow = 2,
ncol = 2,
byrow = T)
d <- c(4, 7)
solve(A, d)
## [1] 3 1
For this system \(x^* = 3\) and \(y^* = 1\).
Next, let’s solve the system \[ x_1 + 2x_2 + 3x_3 + 5x_4 = 5 \]
\[ 2x_1 + 3x_2 + 5x_3 + 9x_4 = 4 \]
\[ 3x_1 + 4x_2 + 7x_3 + x_4 = 0 \] \[ 7x_1 + 6x_2 + 5x_3 + 4x_4 = 3 \]
In R, we solve:
A <- matrix(c(1, 2, 3, 5,
2, 3, 5, 9,
3, 4, 7, 1,
7, 6, 5, 4),
nrow = 4,
ncol = 4,
byrow = T)
d <- c(5, 4, 0, 3)
solve(A, d)
## [1] -5.03125 8.46875 -2.71875 0.25000
Determinants can be found using the det()
function. Recall, matrices must be square.
For a 2x2 matrix:
A <- matrix(c(3, 2,
2, 6),
nrow = 2,
ncol = 2,
byrow = T)
det(A)
## [1] 14
For a 3x3 matrix:
A <- matrix(c(2, 4, 3,
-1, 3, 0,
0, 2, 1),
nrow = 3,
ncol = 3,
byrow = T)
det(A)
## [1] 4
For a 4x4 matrix:
A <- matrix(c(-2, 3, 4, 1,
4, -4, 3, 0,
1, 2, 5, 3,
-1, -2, 5, 3),
nrow = 4,
ncol = 4,
byrow = T)
det(A)
## [1] 294
Suppose we want to solve the following system using Cramer’s Rule: \[ 2x + y - z = 4\]
\[x - 2y + z = 1\]
\[3x - y - 2z = 3\]
In matrix form, this is: \[ \begin{bmatrix} 2 & 1 & -1 \\ 1 & -2 & 1 \\ 3 & -1 & -2 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ \end{bmatrix} = \begin{bmatrix} 4 \\ 1 \\ 3 \\ \end{bmatrix} \]
There is no built-in function in R for Cramer’s Rule, so we will have to do the steps ourselves. First, create the \(A\) matrix and the vector of constants \(d\).
A <- matrix(c(2, 1, -1,
1, -2, 1,
3, -1, -2),
nrow = 3,
ncol = 3,
byrow = T)
d <- c(4, 1, 3)
Next, we will create new matrices for the \(A\) matrix with the \(d\) column inserted.
Ax <- A
Ax[,1] <- d
Ay <- A
Ay[,2] <- d
Az <- A
Az[,3] <- d
Now we can use these matrices to solve with Cramer’s Rule.
x <- det(Ax)/det(A)
x
## [1] 2
y <- det(Ay)/det(A)
y
## [1] 1
z <- det(Az)/det(A)
z
## [1] 1
Of course, you can always write your own function to do Cramer’s Rule!
cramers_rule <- function(A, d) {
n <- nrow(A)
det_A <- det(A)
if (det_A == 0) {
stop("The determinant of A is 0, so the system has no unique solution.")
}
# Initialize solution vector
X <- numeric(n)
# Loop through each variable
for (i in 1:n) {
A_i <- A # Copy of matrix A
A_i[, i] <- d # Replace i-th column with vector d
X[i] <- det(A_i) / det_A
}
return(X)
}
Let’s use the function:
cramers_rule(A, d)
## [1] 2 1 1
Super fast and easy! :D
Suppose that the demand functions for good 1, \(Q_1\), and good 2, \(Q_2\), are the following: \[ Q_1 = 4P_1^{3/2}P_2^{1/2}Y\]
\[ Q_2 = 2P_1^{1/2}P_2^{1/2}Y\] Suppose that \(P_1^* = 4\), \(P_2^* = 6\) and \(Y^* = 2000\).
First, write the function:
f <- function(x){
c(4*x[1]^(3/2)*x[2]^(1/2)*x[3],
2*x[1]^(1/2)*x[2]^(1/2)*x[3])
}
Next, we can find the Jacobian matrix (plugging in the values for \(P_1^* = 4\), \(P_2^* = 6\) and \(Y^* = 2000\)) by using the jacobian()
function from the pracma
package. Make sure to install and load the package before using.
library(pracma)
J <- jacobian(f, c(4, 6, 2000))
J
## [,1] [,2] [,3]
## [1,] 58787.75 13063.945 78.383672
## [2,] 2449.49 1632.993 9.797959
[Example 1:] Find the minimum of the function: \(z = x^3 + 8y^3 - 12xy\)
First, write your function. Then, use the optim()
function to find the critical values. Note that the optim()
function will minimize a function by default. The function also requires initial values–these are given as 0,0 in the function.
f <- function(x){
x[1]^3 + 8*x[2]^3 - 12*x[1]*x[2]
}
optim(c(0,0), f)
## $par
## [1] 2 1
##
## $value
## [1] -8
##
## $counts
## function gradient
## 143 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
par
returns the best set of parameters (i.e., the critical values) and value
returns the value of the function at the critical values.
[Example 2:] Find the critical values of the following function: \(z = -2x^2 - y^2 + 2xy + 4x\)
To find the maximums, use control=list(fnscale=-1)
as an argument in the optim()
function.
f <- function(x){
-2*x[1]^2 - x[2]^2 + 2*x[1]*x[2] + 4*x[1]
}
optim(c(0,0), f, control=list(fnscale=-1))
## $par
## [1] 2 2
##
## $value
## [1] 4
##
## $counts
## function gradient
## 133 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
Suppose that for our problem, we have (inverse) demand equations for two products and a cost function. \[ P_1 = 38 - Q_1 - 2Q_2\] \[ P_2 = 90 - 2Q_1 - 4Q_2\] \[ C = 3Q_1^2 - 2Q_1Q_2 + 2Q_2^2 + 100\]
Consequently then, the profit function is: \[\pi = -4Q_1^2 - 6Q_2^2 - 2Q_1Q_2 + 38Q_1 + 90Q_2 - 100 \]
profit <- function(Q){
-4*Q[1]^2 - 6*Q[2]^2 - 2*Q[1]*Q[2] + 38*Q[1] + 90*Q[2] - 100
}
optim(c(0,0), profit, control = list(fnscale = -1))
## $par
## [1] 2.999819 6.999740
##
## $value
## [1] 272
##
## $counts
## function gradient
## 83 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
Let’s optimize the function \(z = xy + 2x\) subject to \(2x + 5y = 90\) using the nloptr
package.
# Load the nloptr package
library(nloptr)
First, we need to define the objective function and the constraint. Note that just like the optim()
function, nloptr
minimizes by default, so if we want to maximize, we need to negate the objective function.
objective_fn <- function(x) {
# x[1] represents x and x[2] represents y
return(- (x[1] * x[2] + 2 * x[1]))
}
constraint_fn <- function(x) {
# Write the constraint as 2x + 5y - 90
return(2 * x[1] + 5 * x[2] - 90)
}
Then, we need to set some initial values. x0
is the initial guess or the “starting point”. We also need to set lower and upper bounds for x and y. Here, x and y will be non-negative so we set the lower bounds at 0,0.
x0 <- c(1, 1)
lower_bounds <- c(0, 0)
upper_bounds <- c(100, 100)
Next, we define the options for the nloptr
function. For optimization problems with equality constraints, local_opts
is required. The main algorithm, specified by opts
, is NLOPT_LN_AUGLAG_EQ
, an augmented Lagrangian method designed to handle equality constraints in optimization. The option xtol_rel
sets a relative tolerance for the solution, meaning the algorithm will stop when the improvement between iterations falls below this threshold, ensuring a reasonably precise solution. The option maxeval
limits the algorithm to a set amount of evaluations, preventing excessive computation. The tol_constraints_eq
option sets a high precision for satisfying equality constraints, so the algorithm will try to ensure that constraints are met within a very small tolerance. Last, local_opts specifies a local optimizer to use within AUGLAG_EQ
called NLOPT_LN_SBPLX
, a derivative-free algorithm known for stability in constrained optimization.
local_opts <- list("algorithm" = "NLOPT_LN_SBPLX", "xtol_rel" = 1e-6)
opts <- list("algorithm" = "NLOPT_LN_AUGLAG_EQ",
"xtol_rel" = 1e-6,
"maxeval" = 10000,
"tol_constraints_eq" = 1e-8,
"local_opts" = local_opts)
Finally, we can run the optimization algorithm.
result <- nloptr(x0 = x0,
eval_f = objective_fn,
lb = lower_bounds,
ub = upper_bounds,
eval_g_eq = constraint_fn, # Equality constraint
opts = opts)
result
##
## Call:
##
## nloptr(x0 = x0, eval_f = objective_fn, lb = lower_bounds, ub = upper_bounds,
## eval_g_eq = constraint_fn, opts = opts)
##
##
## Minimization using NLopt version 2.7.1
##
## NLopt solver status: 5 ( NLOPT_MAXEVAL_REACHED: Optimization stopped because
## maxeval (above) was reached. )
##
## Number of Iterations....: 10001
## Termination conditions: xtol_rel: 1e-06 maxeval: 10000
## Number of inequality constraints: 0
## Number of equality constraints: 1
## Current value of objective function: -249.997560653075
## Current value of controls: 24.99981 7.99998
The optimal values are \(x^* = 25\) and \(y^* = 8\). The value of the objective function is \(250\) (remember to multiply the result by -1 since we are maximizing). Note that there is a bit of randomness in the algorithm so answers may vary slightly.
Consider the following maximization problem: \[ \text{max } xy \text{ subject to } 10x + 5y = 100\]
Solve with nloptr
as before. We will keep the same x0, bounds, and opts, so we just need to change the objective function and constraint.
objective_fn <- function(x) {
# x[1] represents x and x[2] represents y
return(- (x[1] * x[2]) )
}
constraint_fn <- function(x) {
return(10 * x[1] + 5 * x[2] - 100)
}
result <- nloptr(x0 = x0,
eval_f = objective_fn,
lb = lower_bounds,
ub = upper_bounds,
eval_g_eq = constraint_fn, # Equality constraint
opts = opts)
result
##
## Call:
##
## nloptr(x0 = x0, eval_f = objective_fn, lb = lower_bounds, ub = upper_bounds,
## eval_g_eq = constraint_fn, opts = opts)
##
##
## Minimization using NLopt version 2.7.1
##
## NLopt solver status: 5 ( NLOPT_MAXEVAL_REACHED: Optimization stopped because
## maxeval (above) was reached. )
##
## Number of Iterations....: 10001
## Termination conditions: xtol_rel: 1e-06 maxeval: 10000
## Number of inequality constraints: 0
## Number of equality constraints: 1
## Current value of objective function: -49.8315822207714
## Current value of controls: 4.709713 10.5806
Let’s graph the representation of the problem.
library(ggplot2)
U <- 50
x <- seq(0, 25, 0.1)
y <- U/x # since U = xy
const <- 20 - 2*x # solving the constraint for y
ggplot() +
geom_line(aes(x = x, y = y)) +
geom_line(aes(x = x, y = const)) +
geom_point(aes(x = 5, y = 10)) +
coord_fixed(xlim = c(0, 25),
ylim = c(0, 25)) +
theme_minimal()
Let’s suppose that the firm has to produce 90 units of output \(Q\), with \(w = 21\) and \(r = 3\). Assume that output is produced according to the production function \(Q(L,K) = L^{0.7}K^{0.3}\). Thus, the cost minimization problem is \[ \text{min } 21L + 3K \text{ subject to } 90 = L^{0.7}K^{0.3}\] We will need to change our lower and upper bounds for this problem, so let’s just set them to reasonably large bounds of -500, 500 for both x and y.
lower_bounds <- c(-500, -500)
upper_bounds <- c(500, 500)
objective_fn <- function(x) {
# x[1] represents L and x[2] represents K
return( 21*x[1] + 3*x[2])
}
constraint_fn <- function(x) {
return(90 - x[1]^(0.7) * x[2]^(0.3))
}
result <- nloptr(x0 = x0,
eval_f = objective_fn,
lb = lower_bounds,
ub = upper_bounds,
eval_g_eq = constraint_fn, # Equality constraint
opts = opts)
result
##
## Call:
##
## nloptr(x0 = x0, eval_f = objective_fn, lb = lower_bounds, ub = upper_bounds,
## eval_g_eq = constraint_fn, opts = opts)
##
##
## Minimization using NLopt version 2.7.1
##
## NLopt solver status: 5 ( NLOPT_MAXEVAL_REACHED: Optimization stopped because
## maxeval (above) was reached. )
##
## Number of Iterations....: 10000
## Termination conditions: xtol_rel: 1e-06 maxeval: 10000
## Number of inequality constraints: 0
## Number of equality constraints: 1
## Current value of objective function: 1941.86237061024
## Current value of controls: 64.72944 194.1814
Let’s also graph this cost minimization problem. Labor (L) will be on the x-axis and capital (K) will be on the y-axis.
L <- seq(0, 300, 1)
isoquant <- (90/L^(0.7))^(1/0.3)
isocost <- 1941/3 - 7*L
ggplot() +
geom_line(aes(x = L, y = isoquant)) +
geom_line(aes(x = L, y = isocost)) +
geom_point(aes(x = 64.73, y = 194.18)) +
coord_fixed(xlim = c(0, 300),
ylim = c(0, 500)) +
labs(x = "L", y = "K") +
theme_minimal()
Tidyverse is used for data wrangling. It allows you to manipulate data frames in a rather intuitive way. Tidyverse is a huge package so today we will be focusing on functions from the dplyr package (comes with tidyverse).
First, install and load the tidyverse
package. Also install the gapminder
package; this contains the dataset we will use.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.2.1 ✔ dplyr 1.1.4
## ✔ readr 2.1.2 ✔ stringr 1.5.1
## ✔ purrr 1.0.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ purrr::cross() masks pracma::cross()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(gapminder)
# look at the dataset
head(gapminder)
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
The select()
function allows you to alter your data frame by selecting specific columns. For example, if we want to select the country name:
select(gapminder, country)
## # A tibble: 1,704 × 1
## country
## <fct>
## 1 Afghanistan
## 2 Afghanistan
## 3 Afghanistan
## 4 Afghanistan
## 5 Afghanistan
## 6 Afghanistan
## 7 Afghanistan
## 8 Afghanistan
## 9 Afghanistan
## 10 Afghanistan
## # ℹ 1,694 more rows
If we want to select multiple columns, we can use the c() function:
select(gapminder, c(country, continent))
## # A tibble: 1,704 × 2
## country continent
## <fct> <fct>
## 1 Afghanistan Asia
## 2 Afghanistan Asia
## 3 Afghanistan Asia
## 4 Afghanistan Asia
## 5 Afghanistan Asia
## 6 Afghanistan Asia
## 7 Afghanistan Asia
## 8 Afghanistan Asia
## 9 Afghanistan Asia
## 10 Afghanistan Asia
## # ℹ 1,694 more rows
A quick aside: The tidyverse allows us to chain functions together with %>%
. This is called a pipe and the pipe connects the LHS to the RHS (like reading a book). This makes it much easier to use multiple functions in one line of code, and makes it much easier to read your code!
For example, let’s rewrite the above line of code using pipes.
gapminder %>% select(country, continent)
## # A tibble: 1,704 × 2
## country continent
## <fct> <fct>
## 1 Afghanistan Asia
## 2 Afghanistan Asia
## 3 Afghanistan Asia
## 4 Afghanistan Asia
## 5 Afghanistan Asia
## 6 Afghanistan Asia
## 7 Afghanistan Asia
## 8 Afghanistan Asia
## 9 Afghanistan Asia
## 10 Afghanistan Asia
## # ℹ 1,694 more rows
The filter()
function extracts particular observations based on a condition. Let’s filter the data frame to contain only rows that correspond to the year 1957.
gapminder %>% filter(year == 1957)
## # A tibble: 142 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1957 30.3 9240934 821.
## 2 Albania Europe 1957 59.3 1476505 1942.
## 3 Algeria Africa 1957 45.7 10270856 3014.
## 4 Angola Africa 1957 32.0 4561361 3828.
## 5 Argentina Americas 1957 64.4 19610538 6857.
## 6 Australia Oceania 1957 70.3 9712569 10950.
## 7 Austria Europe 1957 67.5 6965860 8843.
## 8 Bahrain Asia 1957 53.8 138655 11636.
## 9 Bangladesh Asia 1957 39.3 51365468 662.
## 10 Belgium Europe 1957 69.2 8989111 9715.
## # ℹ 132 more rows
We can also make multiple conditions inside the filter. For an and
condition we use &
and for an or
condition we use |
. Try both. Filter the data frame to contain only rows for China in 2007. Then, filter for continent being Asia or Europe.
gapminder %>% filter(country == "China" & year == 2007)
## # A tibble: 1 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 China Asia 2007 73.0 1318683096 4959.
gapminder %>% filter(continent == "Asia" | continent == "Europe")
## # A tibble: 756 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ℹ 746 more rows
Filter does not have to be an equality either! We can filter based on >, <, >=, and <=. Filter the data frame to contain all years from 2000 onward.
gapminder %>% filter(year >= 2000)
## # A tibble: 284 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2002 42.1 25268405 727.
## 2 Afghanistan Asia 2007 43.8 31889923 975.
## 3 Albania Europe 2002 75.7 3508512 4604.
## 4 Albania Europe 2007 76.4 3600523 5937.
## 5 Algeria Africa 2002 71.0 31287142 5288.
## 6 Algeria Africa 2007 72.3 33333216 6223.
## 7 Angola Africa 2002 41.0 10866106 2773.
## 8 Angola Africa 2007 42.7 12420476 4797.
## 9 Argentina Americas 2002 74.3 38331121 8798.
## 10 Argentina Americas 2007 75.3 40301927 12779.
## # ℹ 274 more rows
You use arrange()
to sort observations in ascending or descending order of a particular variable. In this case, you’ll sort the data frame based on the lifeExp
variable. By default, it sounds from lowest to greatest value (ascending order). To do descending order, wrap in the desc()
function
# Ascending
gapminder %>% arrange(lifeExp)
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Rwanda Africa 1992 23.6 7290203 737.
## 2 Afghanistan Asia 1952 28.8 8425333 779.
## 3 Gambia Africa 1952 30 284320 485.
## 4 Angola Africa 1952 30.0 4232095 3521.
## 5 Sierra Leone Africa 1952 30.3 2143249 880.
## 6 Afghanistan Asia 1957 30.3 9240934 821.
## 7 Cambodia Asia 1977 31.2 6978607 525.
## 8 Mozambique Africa 1952 31.3 6446316 469.
## 9 Sierra Leone Africa 1957 31.6 2295678 1004.
## 10 Burkina Faso Africa 1952 32.0 4469979 543.
## # ℹ 1,694 more rows
# Descending
gapminder %>% arrange(desc(lifeExp))
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Japan Asia 2007 82.6 127467972 31656.
## 2 Hong Kong, China Asia 2007 82.2 6980412 39725.
## 3 Japan Asia 2002 82 127065841 28605.
## 4 Iceland Europe 2007 81.8 301931 36181.
## 5 Switzerland Europe 2007 81.7 7554661 37506.
## 6 Hong Kong, China Asia 2002 81.5 6762476 30209.
## 7 Australia Oceania 2007 81.2 20434176 34435.
## 8 Spain Europe 2007 80.9 40448191 28821.
## 9 Sweden Europe 2007 80.9 9031088 33860.
## 10 Israel Asia 2007 80.7 6426679 25523.
## # ℹ 1,694 more rows
Now that we know multiple functions, this is where the pipe operator really comes in handy. We can chain multiple operations together. For example, a filter into an arrange:
gapminder %>%
filter(year == 1957) %>%
arrange(desc(pop))
## # A tibble: 142 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 China Asia 1957 50.5 637408000 576.
## 2 India Asia 1957 40.2 409000000 590.
## 3 United States Americas 1957 69.5 171984000 14847.
## 4 Japan Asia 1957 65.5 91563009 4318.
## 5 Indonesia Asia 1957 39.9 90124000 859.
## 6 Germany Europe 1957 69.1 71019069 10188.
## 7 Brazil Americas 1957 53.3 65551171 2487.
## 8 United Kingdom Europe 1957 70.4 51430000 11283.
## 9 Bangladesh Asia 1957 39.3 51365468 662.
## 10 Italy Europe 1957 67.8 49182000 6249.
## # ℹ 132 more rows
We can create new variables with mutate()
. Suppose we want total GDP, instead of GDP per capita.
gapminder %>% mutate(gdp = gdpPercap * pop)
## # A tibble: 1,704 × 7
## country continent year lifeExp pop gdpPercap gdp
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330.
## 2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670.
## 3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797.
## 4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150.
## 5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274.
## 6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231.
## 7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401.
## 8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309.
## 9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589.
## 10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875.
## # ℹ 1,694 more rows
Next, let’s combine what we know so far. Find the countries with the highest life expectancy, in months, in the year 2007.
gapminder %>% filter(year == 2007) %>%
mutate(lifeExpMonths = lifeExp * 12) %>%
arrange(desc(lifeExpMonths))
## # A tibble: 142 × 7
## country continent year lifeExp pop gdpPercap lifeExpMonths
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Japan Asia 2007 82.6 127467972 31656. 991.
## 2 Hong Kong, China Asia 2007 82.2 6980412 39725. 986.
## 3 Iceland Europe 2007 81.8 301931 36181. 981.
## 4 Switzerland Europe 2007 81.7 7554661 37506. 980.
## 5 Australia Oceania 2007 81.2 20434176 34435. 975.
## 6 Spain Europe 2007 80.9 40448191 28821. 971.
## 7 Sweden Europe 2007 80.9 9031088 33860. 971.
## 8 Israel Asia 2007 80.7 6426679 25523. 969.
## 9 France Europe 2007 80.7 61083916 30470. 968.
## 10 Canada Americas 2007 80.7 33390141 36319. 968.
## # ℹ 132 more rows
This will group data together and can make summary statistics. You can use mean, sum, median, etc. Let’s find the median life expectancy in the data set.
gapminder %>% summarize(median(lifeExp))
## # A tibble: 1 × 1
## `median(lifeExp)`
## <dbl>
## 1 60.7
Note that this gives the median for the entire dataset. We can use group_by()
to group the data first, and then summarize. For example, if we want the median life expectancy per year:
gapminder %>% group_by(year) %>% summarize(median(lifeExp))
## # A tibble: 12 × 2
## year `median(lifeExp)`
## <int> <dbl>
## 1 1952 45.1
## 2 1957 48.4
## 3 1962 50.9
## 4 1967 53.8
## 5 1972 56.5
## 6 1977 59.7
## 7 1982 62.4
## 8 1987 65.8
## 9 1992 67.7
## 10 1997 69.4
## 11 2002 70.8
## 12 2007 71.9
You can also group by multiple items. If we want the median by year and by continent:
gapminder %>% group_by(year, continent) %>% summarize(median(lifeExp))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 60 × 3
## # Groups: year [12]
## year continent `median(lifeExp)`
## <int> <fct> <dbl>
## 1 1952 Africa 38.8
## 2 1952 Americas 54.7
## 3 1952 Asia 44.9
## 4 1952 Europe 65.9
## 5 1952 Oceania 69.3
## 6 1957 Africa 40.6
## 7 1957 Americas 56.1
## 8 1957 Asia 48.3
## 9 1957 Europe 67.6
## 10 1957 Oceania 70.3
## # ℹ 50 more rows
You can also summarize multiple variables at once. Let’s summarize the mean gdp per capita as well.
gapminder %>% group_by(year, continent) %>% summarize(median(lifeExp), mean(gdpPercap))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 60 × 4
## # Groups: year [12]
## year continent `median(lifeExp)` `mean(gdpPercap)`
## <int> <fct> <dbl> <dbl>
## 1 1952 Africa 38.8 1253.
## 2 1952 Americas 54.7 4079.
## 3 1952 Asia 44.9 5195.
## 4 1952 Europe 65.9 5661.
## 5 1952 Oceania 69.3 10298.
## 6 1957 Africa 40.6 1385.
## 7 1957 Americas 56.1 4616.
## 8 1957 Asia 48.3 5788.
## 9 1957 Europe 67.6 6963.
## 10 1957 Oceania 70.3 11599.
## # ℹ 50 more rows
We have already used ggplot to make several graphs earlier in this course. However, earlier we were graphing equations and functions. Here, we will learn how to visualize our data from the data frame.
Let’s use the year 2007 to make a scatterplot of the GDP per capita and life expectancy by country.
gapminder_2000 <- gapminder %>%
filter(year == 2007)
ggplot(gapminder_2000, aes(x = gdpPercap, y = lifeExp)) + geom_point()
Since some countries have a very high GDP per capita, we might want to change our x-axis to a log scale.
ggplot(gapminder_2000, aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
scale_x_log10()
Faceting divides a graph into subplots based on one of its variables, such as the continent.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
scale_x_log10() +
facet_wrap(~continent)
You can also visualize summary stats. Let’s make a line plot of the median life expectancy over time.
medianlife <- gapminder %>% group_by(year) %>% summarize(med = median(lifeExp))
ggplot(medianlife, aes(x = year, y = med)) + geom_line()
Next, let’s use a bar plot to show the gdp per capita in 2007 by continent.
gapminder_07 <- gapminder %>% filter(year == 2007)
ggplot(gapminder_07, aes(x = continent, y = gdpPercap)) + geom_col()
Similarly, we can make histograms of a variable using geom_histogram()
. Here is a histogram of the GDP per capita in 2007.
ggplot(gapminder_07, aes(x = gdpPercap)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Box plots show the distribution as well.
ggplot(gapminder_07, aes(x = continent, y = gdpPercap)) +
geom_boxplot() +
scale_y_log10()
We use the base function integrate()
to compute definite integrals.
f <- function(x){x^2}
integrate(f, lower = 1, upper = 4)
## 21 with absolute error < 2.3e-13
f <- function(x){exp(x) - x^2}
integrate(f, lower = 1, upper = 3)
## 8.700588 with absolute error < 9.7e-14
f <- function(x){1/x^2}
integrate(f, lower = 1, upper = Inf)
## 1 with absolute error < 1.1e-14
We can compute indefinite integrals with the antiD()
function from the mosaicCalc
package.
library(mosaicCalc)
antiD(4*x^3 ~ x)
## function (x, C = 0)
## x^4 + C
antiD(6*x^(1/2) ~ x)
## function (x, C = 0)
## 4 * sqrt(x)^3 + C
antiD(4*(3*x - 5)^3 ~ x)
## function (x, C = 0)
## 27 * x^4 - 180 * x^3 + 450 * x^2 - 500 * x + C
Suppose we want to find the change in total cost for a firm with \(MC = 0.027Q^2 - Q + 15\) from Q = 10 to Q = 20.
mc <- function(Q){0.027*Q^2 - Q + 15}
integrate(mc, 10, 20)
## 63 with absolute error < 7e-13
The change in total cost is 63.
The conumer surplus is given by: \[ \int_0^{Q_0} P(Q)dQ - P_0Q_0\]
The producer suprlus is given by: \[P_0Q_0 - \int_0^{Q_0} MC(Q)dQ\]
Assume that the demand and supply equations are \(P(Q) = -2Q + 21\) and \(MC(Q) = Q + 3\). We then need to find \(P_0\) and \(Q_0\), the equilibrium price and quantity.
# create demand and supply functions
demand <- function(Q){-2*Q + 21}
supply <- function(Q){Q + 3}
# find equilibrium value for Q
Q0 <- uniroot(function(Q){demand(Q) - supply(Q)},
c(0, 25))$root
Q0
## [1] 6
# plug into either demand or supply to get price
P0 <- Q0 + 3
P0
## [1] 9
Now we can integrate.
cs <- integrate(demand, 0, Q0)$value - P0*Q0
cs
## [1] 36
ps <- P0*Q0 - integrate(supply, 0, Q0)$value
ps
## [1] 18
Let’s make a graph illustrating the consumer and producer suprlus. First, put values into a data frame.
Q <- seq(0, 25, by = .1)
D <- -2*Q + 21
S <- Q + 3
df <- data.frame(Q, D, S, Q0, P0)
Then, we can make the graph.
ggplot(df) + geom_line(aes(Q, D)) + geom_line(aes(Q, S)) +
geom_ribbon(data = subset(df, 0 <= Q & Q <= 6), aes(Q, ymin = P0, ymax = D), fill = "yellow") +
geom_ribbon(data = subset(df, 0 <= Q & Q <= 6), aes(Q, ymin = S, ymax = P0), fill = "green") +
theme_minimal() +
labs(x = "Quantity", y = "Price") +
ylim(0,25) +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 0)