Part I. Basic mathematical Operations in R

R is a programming language and free software environment for statistical computing and graphics. R provides a series of operators for calculations on numbers, arrays and metrics. For example, it can calculate 10 + 2 using the following syntax. (Click the green arrow to see the result.)

10 + 2

## [1] 12

The result can be saved to a new variable ‘outcome’ using either one of the following two lines:

outcome = 10 + 2
outcome <- 10 + 2

This newly created variable can be found under the Environment panel.

The other basic math operations can be easily conducted in R as well.

10 - 2 # 10 minus 2

## [1] 8

10 * 2 # 10 times 2

## [1] 20

10 / 2 # 10 divided by 2

## [1] 5

sqrt(4) # Returns the square root of 4

## [1] 2

10^2 # 10 to power 2

## [1] 100

exp(0) # The exponential of 0

## [1] 1

Exercise 1 [1 point]

Calculate 1+2 and save the value to the new variable, ‘q1’.

# Your answers:
q1 <- 1 + 2

Part II. Install and call a new library

R is more than a calculator. The power of R relies on the external libraries. Here, we install one library used throughout this semester: “wooldridge”. This library provides all the datasets used in the textbook exercises.

# install.packages("wooldridge") #  Quote the library name when installing the library.

Note that you only need to install a package for one time on your computer.

Next, every time you start Rstudio, you need to call the library to get functions in this library ready for use. You can use the following syntax:

library(wooldridge) # No need to quote the library name when calling the library.

## Warning: package 'wooldridge' was built under R version 4.2.1

Exercise 2 [1 point]

Install the following libraries below: car, lmtest, plm, and sandwich.

# Your answers:
# install.packages("car")
# install.packages("lmtest")
# install.packages("plm")
# install.packages("sandwich")

Load in the following libraries below: car, lmtest, plm, and sandwich.

# Your answers:
library(car)

## Warning: package 'car' was built under R version 4.2.1

## Warning: package 'carData' was built under R version 4.2.1

library(lmtest)

## Warning: package 'lmtest' was built under R version 4.2.1

## Warning: package 'zoo' was built under R version 4.2.1

library(plm)

## Warning: package 'plm' was built under R version 4.2.1

library(sandwich)

## Warning: package 'sandwich' was built under R version 4.2.1

Part III. Load in a dataset and report the summary statistics

Use the following syntax to load in dataset ‘hprice1’ from the library ‘wooldridge’.

data('hprice1') # This is a function from library 'wooldridge'.

To find the list of variable names in this dataset, use the following syntax:

names(hprice1)

##  [1] "price"    "assess"   "bdrms"    "lotsize"  "sqrft"    "colonial"
##  [7] "lprice"   "lassess"  "llotsize" "lsqrft"

Many variable names are in abbreviation. To find out a detailed explanation of each variable, we need to refer to ‘Wooldridge_Data Manual.pdf’, which is available on Quercus.

We can also report the first few entries in this dataset using function ‘head()’:

head(hprice1)

To report summary statistics of all the variables in the dataset, we can do the following:

summary(hprice1)

##      price           assess          bdrms          lotsize          sqrft     
##  Min.   :111.0   Min.   :198.7   Min.   :2.000   Min.   : 1000   Min.   :1171  
##  1st Qu.:230.0   1st Qu.:253.9   1st Qu.:3.000   1st Qu.: 5733   1st Qu.:1660  
##  Median :265.5   Median :290.2   Median :3.000   Median : 6430   Median :1845  
##  Mean   :293.5   Mean   :315.7   Mean   :3.568   Mean   : 9020   Mean   :2014  
##  3rd Qu.:326.2   3rd Qu.:352.1   3rd Qu.:4.000   3rd Qu.: 8583   3rd Qu.:2227  
##  Max.   :725.0   Max.   :708.6   Max.   :7.000   Max.   :92681   Max.   :3880  
##     colonial          lprice         lassess         llotsize     
##  Min.   :0.0000   Min.   :4.710   Min.   :5.292   Min.   : 6.908  
##  1st Qu.:0.0000   1st Qu.:5.438   1st Qu.:5.537   1st Qu.: 8.654  
##  Median :1.0000   Median :5.582   Median :5.671   Median : 8.769  
##  Mean   :0.6932   Mean   :5.633   Mean   :5.718   Mean   : 8.905  
##  3rd Qu.:1.0000   3rd Qu.:5.788   3rd Qu.:5.864   3rd Qu.: 9.058  
##  Max.   :1.0000   Max.   :6.586   Max.   :6.563   Max.   :11.437  
##      lsqrft     
##  Min.   :7.066  
##  1st Qu.:7.415  
##  Median :7.520  
##  Mean   :7.573  
##  3rd Qu.:7.708  
##  Max.   :8.264

If we only care about the summary stat of one variable, use the following syntax:

summary(hprice1$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   111.0   230.0   265.5   293.5   326.2   725.0

mean(hprice1$price) # Sample mean

## [1] 293.546

sd(hprice1$price) # Sample standard deviation

## [1] 102.7134

Exercise 3 [1 point]

Load in dataset ‘crime1’ from wooldridge library.

Next, report the sample mean and the standard deviation of the number of months in prison during 1986.

Note that you need to go to the data manual find which variable represents the number of months in prison during 1986.

# Your answers:
data('crime1')
mean(crime1$avgsen)

## [1] 0.6322936

sd(crime1$avgsen)

## [1] 3.508031

Part IV. Data manipulation

Sometimes, we need to select a subsample that meet certain criteria for data analysis. For example, data(‘alcohol’) provides a sample of individuals with their demographic characteristics and alcohol abuse behavior. In this dataset, abuse = 1 if the individual has abused alcohol.

data('alcohol')
head(alcohol)

mean(alcohol$abuse)

## [1] 0.09916514

In this sample, 9.9% individuals have alcohol abuse activity.

We want to analyze the alcohol abuse behavior among those with more than 12 years of education. We first select the subsample from the database, and save this subsample into a new dataset, ‘college_sample’.

college_sample <- subset(alcohol, educ > 12)
head(college_sample)

mean(college_sample$abuse)

## [1] 0.08769793

The mean of abuse is 0.0877. Therefore, among individuals with college education, 8.8% of them have alcohol abuse activity.

Exercise 4 [1 point]

In the dataset ‘crime1’, among those who were born in 1960, find the average time in prison since 18 and the standard deviation.

# Your answers:
data('crime1')
born_1960 <- subset(crime1,born60 = 1)
mean(born_1960$tottime)

## [1] 0.8387523

sd(born_1960$tottime)

## [1] 4.607019

Part V. Scatterplot of variables

A scatterplot can easily visualize the sample relationship between x and y. For example, in the following scatterplot, x-axis represents years of education and y-axis represents the hourly wage. The scatterplot shows that with more years of education, the real earnings tend to be higher.

data('wage1')
plot(wage1$educ, wage1$wage, main="Returns to Education",
   xlab="years of education", ylab="avg hourly earnings")

Exercise 5 [1 point]

Use dataset ‘hprice1’, create a scatterplot to show the relation between the size of the house (x-axis) and the assessed value of the house (y-axis).

Again, you need to go to the data manual to find which variables in the dataset represent the ones you need.

# Your answers:
data('hprice1')
plot(hprice1$sqrft, hprice1$assess, main = "Relation Between House Size and Assessed Value", xlab = "Size of House (sqft)", ylab = "Assessed Value ($1000s)")

R Tutorial Notes

Yue Yu