R is a programming language and free software environment for statistical computing and graphics. R provides a series of operators for calculations on numbers, arrays and metrics. For example, it can calculate 10 + 2 using the following syntax. (Click the green arrow to see the result.)
10 + 2
## [1] 12
The result can be saved to a new variable ‘outcome’ using either one of the following two lines:
outcome = 10 + 2
outcome <- 10 + 2
This newly created variable can be found under the Environment panel.
The other basic math operations can be easily conducted in R as well.
10 - 2 # 10 minus 2
## [1] 8
10 * 2 # 10 times 2
## [1] 20
10 / 2 # 10 divided by 2
## [1] 5
sqrt(4) # Returns the square root of 4
## [1] 2
10^2 # 10 to power 2
## [1] 100
exp(0) # The exponential of 0
## [1] 1
Calculate 1+2 and save the value to the new variable, ‘q1’.
# Your answers:
q1 <- 1 + 2
R is more than a calculator. The power of R relies on the external libraries. Here, we install one library used throughout this semester: “wooldridge”. This library provides all the datasets used in the textbook exercises.
# install.packages("wooldridge") # Quote the library name when installing the library.
Note that you only need to install a package for one time on your computer.
Next, every time you start Rstudio, you need to call the library to get functions in this library ready for use. You can use the following syntax:
library(wooldridge) # No need to quote the library name when calling the library.
## Warning: package 'wooldridge' was built under R version 4.2.1
Install the following libraries below: car, lmtest, plm, and sandwich.
# Your answers:
# install.packages("car")
# install.packages("lmtest")
# install.packages("plm")
# install.packages("sandwich")
Load in the following libraries below: car, lmtest, plm, and sandwich.
# Your answers:
library(car)
## Warning: package 'car' was built under R version 4.2.1
## Warning: package 'carData' was built under R version 4.2.1
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.2.1
## Warning: package 'zoo' was built under R version 4.2.1
library(plm)
## Warning: package 'plm' was built under R version 4.2.1
library(sandwich)
## Warning: package 'sandwich' was built under R version 4.2.1
Use the following syntax to load in dataset ‘hprice1’ from the library ‘wooldridge’.
data('hprice1') # This is a function from library 'wooldridge'.
To find the list of variable names in this dataset, use the following syntax:
names(hprice1)
## [1] "price" "assess" "bdrms" "lotsize" "sqrft" "colonial"
## [7] "lprice" "lassess" "llotsize" "lsqrft"
Many variable names are in abbreviation. To find out a detailed explanation of each variable, we need to refer to ‘Wooldridge_Data Manual.pdf’, which is available on Quercus.
We can also report the first few entries in this dataset using function ‘head()’:
head(hprice1)
To report summary statistics of all the variables in the dataset, we can do the following:
summary(hprice1)
## price assess bdrms lotsize sqrft
## Min. :111.0 Min. :198.7 Min. :2.000 Min. : 1000 Min. :1171
## 1st Qu.:230.0 1st Qu.:253.9 1st Qu.:3.000 1st Qu.: 5733 1st Qu.:1660
## Median :265.5 Median :290.2 Median :3.000 Median : 6430 Median :1845
## Mean :293.5 Mean :315.7 Mean :3.568 Mean : 9020 Mean :2014
## 3rd Qu.:326.2 3rd Qu.:352.1 3rd Qu.:4.000 3rd Qu.: 8583 3rd Qu.:2227
## Max. :725.0 Max. :708.6 Max. :7.000 Max. :92681 Max. :3880
## colonial lprice lassess llotsize
## Min. :0.0000 Min. :4.710 Min. :5.292 Min. : 6.908
## 1st Qu.:0.0000 1st Qu.:5.438 1st Qu.:5.537 1st Qu.: 8.654
## Median :1.0000 Median :5.582 Median :5.671 Median : 8.769
## Mean :0.6932 Mean :5.633 Mean :5.718 Mean : 8.905
## 3rd Qu.:1.0000 3rd Qu.:5.788 3rd Qu.:5.864 3rd Qu.: 9.058
## Max. :1.0000 Max. :6.586 Max. :6.563 Max. :11.437
## lsqrft
## Min. :7.066
## 1st Qu.:7.415
## Median :7.520
## Mean :7.573
## 3rd Qu.:7.708
## Max. :8.264
If we only care about the summary stat of one variable, use the following syntax:
summary(hprice1$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 111.0 230.0 265.5 293.5 326.2 725.0
mean(hprice1$price) # Sample mean
## [1] 293.546
sd(hprice1$price) # Sample standard deviation
## [1] 102.7134
Load in dataset ‘crime1’ from wooldridge library.
Next, report the sample mean and the standard deviation of the number of months in prison during 1986.
Note that you need to go to the data manual find which variable represents the number of months in prison during 1986.
# Your answers:
data('crime1')
mean(crime1$avgsen)
## [1] 0.6322936
sd(crime1$avgsen)
## [1] 3.508031
Sometimes, we need to select a subsample that meet certain criteria for data analysis. For example, data(‘alcohol’) provides a sample of individuals with their demographic characteristics and alcohol abuse behavior. In this dataset, abuse = 1 if the individual has abused alcohol.
data('alcohol')
head(alcohol)
mean(alcohol$abuse)
## [1] 0.09916514
In this sample, 9.9% individuals have alcohol abuse activity.
We want to analyze the alcohol abuse behavior among those with more than 12 years of education. We first select the subsample from the database, and save this subsample into a new dataset, ‘college_sample’.
college_sample <- subset(alcohol, educ > 12)
head(college_sample)
mean(college_sample$abuse)
## [1] 0.08769793
The mean of abuse is 0.0877. Therefore, among individuals with college education, 8.8% of them have alcohol abuse activity.
In the dataset ‘crime1’, among those who were born in 1960, find the average time in prison since 18 and the standard deviation.
# Your answers:
data('crime1')
born_1960 <- subset(crime1,born60 = 1)
mean(born_1960$tottime)
## [1] 0.8387523
sd(born_1960$tottime)
## [1] 4.607019
A scatterplot can easily visualize the sample relationship between x and y. For example, in the following scatterplot, x-axis represents years of education and y-axis represents the hourly wage. The scatterplot shows that with more years of education, the real earnings tend to be higher.
data('wage1')
plot(wage1$educ, wage1$wage, main="Returns to Education",
xlab="years of education", ylab="avg hourly earnings")
Use dataset ‘hprice1’, create a scatterplot to show the relation between the size of the house (x-axis) and the assessed value of the house (y-axis).
Again, you need to go to the data manual to find which variables in the dataset represent the ones you need.
# Your answers:
data('hprice1')
plot(hprice1$sqrft, hprice1$assess, main = "Relation Between House Size and Assessed Value", xlab = "Size of House (sqft)", ylab = "Assessed Value ($1000s)")