MSSP 897 Lab 1: Introduction to R

The slides for Lab 1 session is now available to review here.

Before the next lab

1. (Optional) Email me (chiuwu@upenn.edu) your learning goals and how you envision the optimal learning environment for you.

2. Assignment 1 will be due at 12:59pm February 4th, 2019 (next Monday)

Step 1. Create a project, an R script and set up work directory

In your RStudio, create a new project under the tab “File” and select a preferred file location. Project allows you to keep all the files associated with a project organized together, each with their own working directory, workspace, history, and source documents.

Download the dataset titled WLTH1994.csv from the Canvas site to the same location for your project.

Then, open a new R Script. R Script is a series of commands that you can execute at one time and you can save lot of time. script is just a plain text file with R commands in it.

Check your working directory which is the folder where you save your files. Is the path correct? Yes, and I can see the data file in my Files Pane on the right (in my RStudio) as well.

getwd()

## [1] "C:/Users/bijou/Box Sync/Teaching/MSSP897/MSSP897_lab"

‘wealth’ is the variable where the data will be stored. If the parameter “header=” is “TRUE”, then the first row will be treated as the row names.

wealth <- read.csv("WLTH1994.csv",header=TRUE, sep=",")

If your data is not stored in the project file, you will need to insert the full path for your data, remember to use the forward slash “/”

wealth <- read.csv("C:/Users/bijou/Box Sync/Teaching/MSSP897/MSSP897_lab/WLTH1994.csv",header=TRUE, sep=",")

What if our data was saved as an Excel sheet (.xlsx)? We first install a new package called “readxl”: install.packages(“readxl”) and run the following codes:

library(readxl)
wealth <- read_excel("WLTH1994.xlsx")

Alternatively, you can open it through the files pane by clicking the .xlsx file

Step 2. Explore our data (exploratory data analysis)

Examine the dimensions of your dataset, it returns two numbers: (1) # of Rows (2) # of Columns.

dim(wealth)

## [1] 8628   19

Remove the scientific notation in the format of our data.

options(scipen=999)

What are the variables included in the dataset?

colnames(wealth)

##  [1] "S300"     "FAMID94"  "S302"     "Bus94"    "S304"     "CHK94"   
##  [7] "S306"     "DBT94"    "S308"     "REAL94"   "S310"     "STOCK94" 
## [13] "S313"     "OTHASS94" "VALU94"   "NFA94"    "NW94"     "equity94"
## [19] "home94"

Examine the data structure of the variables in the data frame (factor,numeric,integer,etc.).

str(wealth)

## Classes 'tbl_df', 'tbl' and 'data.frame':    8628 obs. of  19 variables:
##  $ S300    : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ FAMID94 : num  14 15 18 19 24 34 48 51 60 86 ...
##  $ S302    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Bus94   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ S304    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ CHK94   : num  20000 2500 2000 400 40000 300 4000 1500 200000 50000 ...
##  $ S306    : num  0 1 0 0 0 0 0 1 1 0 ...
##  $ DBT94   : num  0 30000 0 0 0 0 0 25000 500 0 ...
##  $ S308    : num  0 0 1 0 1 0 0 1 0 0 ...
##  $ REAL94  : num  0 0 20000 0 8000 0 0 1000 0 0 ...
##  $ S310    : num  1 0 0 0 0 0 0 1 1 0 ...
##  $ STOCK94 : num  55000 0 0 0 0 0 0 1200 60000 0 ...
##  $ S313    : num  20000 5000 1000 2000 40000 0 3000 5000 55000 13000 ...
##  $ OTHASS94: num  0 1 0 0 0 1 0 0 0 0 ...
##  $ VALU94  : num  0 10000 0 0 0 1000 0 0 0 0 ...
##  $ NFA94   : num  95000 -12500 23000 2400 88000 ...
##  $ NW94    : num  165000 11500 27000 4400 188000 ...
##  $ equity94: num  70000 24000 4000 2000 100000 0 47000 45000 122000 0 ...
##  $ home94  : num  1 1 1 1 1 0 1 1 1 0 ...

# want to find out the type of data structure for a particular variable? Try it yourself.

Examine the summary statistics of variables in the dataset, what can you learn from it?

summary(wealth)

##       S300      FAMID94           S302            Bus94        
##  Min.   :2   Min.   :   14   Min.   :0.0000   Min.   :      0  
##  1st Qu.:2   1st Qu.: 5297   1st Qu.:0.0000   1st Qu.:      0  
##  Median :2   Median : 7908   Median :0.0000   Median :      0  
##  Mean   :2   Mean   : 8515   Mean   :0.1042   Mean   :  16430  
##  3rd Qu.:2   3rd Qu.:12351   3rd Qu.:0.0000   3rd Qu.:      0  
##  Max.   :2   Max.   :16970   Max.   :1.0000   Max.   :5000000  
##       S304            CHK94              S306            DBT94        
##  Min.   :0.0000   Min.   :      0   Min.   :0.0000   Min.   :      0  
##  1st Qu.:0.0000   1st Qu.:      0   1st Qu.:0.0000   1st Qu.:      0  
##  Median :1.0000   Median :   1000   Median :0.0000   Median :      0  
##  Mean   :0.6592   Mean   :  13574   Mean   :0.4711   Mean   :   4937  
##  3rd Qu.:1.0000   3rd Qu.:   8000   3rd Qu.:1.0000   3rd Qu.:   4000  
##  Max.   :1.0000   Max.   :1250000   Max.   :1.0000   Max.   :2000000  
##       S308            REAL94             S310           STOCK94       
##  Min.   :0.0000   Min.   :      0   Min.   :0.0000   Min.   :      0  
##  1st Qu.:0.0000   1st Qu.:      0   1st Qu.:0.0000   1st Qu.:      0  
##  Median :0.0000   Median :      0   Median :0.0000   Median :      0  
##  Mean   :0.1365   Mean   :  16373   Mean   :0.2528   Mean   :  17782  
##  3rd Qu.:0.0000   3rd Qu.:      0   3rd Qu.:1.0000   3rd Qu.:      0  
##  Max.   :1.0000   Max.   :7000000   Max.   :1.0000   Max.   :9999997  
##       S313           OTHASS94         VALU94            NFA94         
##  Min.   :     0   Min.   :0.000   Min.   :      0   Min.   : -694000  
##  1st Qu.:   700   1st Qu.:0.000   1st Qu.:      0   1st Qu.:      59  
##  Median :  5000   Median :0.000   Median :      0   Median :    9125  
##  Mean   :  9017   Mean   :0.207   Mean   :   7572   Mean   :   75811  
##  3rd Qu.: 12000   3rd Qu.:0.000   3rd Qu.:      0   3rd Qu.:   48000  
##  Max.   :270000   Max.   :1.000   Max.   :1100000   Max.   :10284997  
##       NW94             equity94           home94      
##  Min.   : -644000   Min.   :-682000   Min.   :0.0000  
##  1st Qu.:    1500   1st Qu.:      0   1st Qu.:0.0000  
##  Median :   26200   Median :   5550   Median :1.0000  
##  Mean   :  110187   Mean   :  34376   Mean   :0.5568  
##  3rd Qu.:  100000   3rd Qu.:  45000   3rd Qu.:1.0000  
##  Max.   :10584997   Max.   :9999996   Max.   :1.0000

summary(wealth$equity94)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -682000       0    5550   34376   45000 9999996

Create a histogram for the variable equity94 and examine the distribution, how would you describe the distribution?

hist(wealth$equity94)

Create a boxplot for equity94, what can you observe from the graph?

boxplot (wealth$equity94)

Step 3. Before proceeding to analyze our data, we have to do some data cleaning.

Let’s create a new variable named equity94R by subtracting net financial assets from net worth.

wealth$equity94R <- wealth$NW94 - wealth$NFA94

New variables can also be created using if-else statements The new variable home94R is equal to 0 if equity94 equals 0 and 1 otherwise.

wealth$home94R<-ifelse(wealth$equity94==0,0,1)

Sometimes, our minds play trick on us and you want to delete a variable. For instance, you can remove the variable “equity94R” from your wealth dataset as follows:

wealth$equity94R<-NULL

We can assign value labels to the variable home94R (0 = no, 1=yes).

wealth$home94R <- factor(wealth$home94R,levels = c(0,1),labels = c("no", "yes"))

Subset the data by a specific family id number by the family id number 16922, by the value of home94 equal to 1, and lastly, adding an additional condition that NFA94 has a value greater than 10000.

newdf <- subset(wealth, wealth$FAMID94==16922)
str(wealth$home94)

##  num [1:8628] 1 1 1 1 1 0 1 1 1 0 ...

newdf1 <- subset(wealth, wealth$home94==1) 
newdf2 <- subset(wealth, wealth$home94==1 & wealth$NFA94 > 10000)

Try it yourself!

Create a frequency table of a factor variable; The frequencies are ordered and labelled by the levels attribute of the factor.

table(wealth$home94R)

## 
##   no  yes 
## 3824 4804

# How many 0s and 1s in home94R?

R operators: R has several operators to perform tasks including arithmetic, logical and bitwise operations.

Operator	Meaning
<	less than
<=	less than or equal to
>	greater than
>=	greater than or equal to
==	exactly equal to
!=	not equal to
!x	Not x
x	y
x & y	x AND y
isTRUE(x)	test if X is TRUE

Quit RStudio

When you quit RStudio, remember to agree to “save workspace image to ~/.Rdata” when the prompting window pops up.

Take a breath! Resources are everywhere! And, you won’t break R no matter how hard you try! :)

?functionName or “Help” tab - look at R documentations
Google - “the error message”, “how to rename a variable in r”
R communities - look for posts on Stackoverflow.com/ , www.r-bloggers.com/
Some good advice - https://www.r-bloggers.com/the-5-most-effective-ways-to-learn-r/
My contact email - chiuwu@upenn.edu