1. (Optional) Email me (chiuwu@upenn.edu) your learning goals and how you envision the optimal learning environment for you.
2. Assignment 1 will be due at 12:59pm February 4th, 2019 (next Monday)
In your RStudio, create a new project under the tab “File” and select a preferred file location. Project allows you to keep all the files associated with a project organized together, each with their own working directory, workspace, history, and source documents.
Download the dataset titled WLTH1994.csv from the Canvas site to the same location for your project.
Then, open a new R Script. R Script is a series of commands that you can execute at one time and you can save lot of time. script is just a plain text file with R commands in it.
Check your working directory which is the folder where you save your files. Is the path correct? Yes, and I can see the data file in my Files Pane on the right (in my RStudio) as well.
getwd()
## [1] "C:/Users/bijou/Box Sync/Teaching/MSSP897/MSSP897_lab"
‘wealth’ is the variable where the data will be stored. If the parameter “header=” is “TRUE”, then the first row will be treated as the row names.
wealth <- read.csv("WLTH1994.csv",header=TRUE, sep=",")
If your data is not stored in the project file, you will need to insert the full path for your data, remember to use the forward slash “/”
wealth <- read.csv("C:/Users/bijou/Box Sync/Teaching/MSSP897/MSSP897_lab/WLTH1994.csv",header=TRUE, sep=",")
What if our data was saved as an Excel sheet (.xlsx)? We first install a new package called “readxl”: install.packages(“readxl”) and run the following codes:
library(readxl)
wealth <- read_excel("WLTH1994.xlsx")
Alternatively, you can open it through the files pane by clicking the .xlsx file
Examine the dimensions of your dataset, it returns two numbers: (1) # of Rows (2) # of Columns.
dim(wealth)
## [1] 8628 19
Remove the scientific notation in the format of our data.
options(scipen=999)
What are the variables included in the dataset?
colnames(wealth)
## [1] "S300" "FAMID94" "S302" "Bus94" "S304" "CHK94"
## [7] "S306" "DBT94" "S308" "REAL94" "S310" "STOCK94"
## [13] "S313" "OTHASS94" "VALU94" "NFA94" "NW94" "equity94"
## [19] "home94"
Examine the data structure of the variables in the data frame (factor,numeric,integer,etc.).
str(wealth)
## Classes 'tbl_df', 'tbl' and 'data.frame': 8628 obs. of 19 variables:
## $ S300 : num 2 2 2 2 2 2 2 2 2 2 ...
## $ FAMID94 : num 14 15 18 19 24 34 48 51 60 86 ...
## $ S302 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Bus94 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ S304 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ CHK94 : num 20000 2500 2000 400 40000 300 4000 1500 200000 50000 ...
## $ S306 : num 0 1 0 0 0 0 0 1 1 0 ...
## $ DBT94 : num 0 30000 0 0 0 0 0 25000 500 0 ...
## $ S308 : num 0 0 1 0 1 0 0 1 0 0 ...
## $ REAL94 : num 0 0 20000 0 8000 0 0 1000 0 0 ...
## $ S310 : num 1 0 0 0 0 0 0 1 1 0 ...
## $ STOCK94 : num 55000 0 0 0 0 0 0 1200 60000 0 ...
## $ S313 : num 20000 5000 1000 2000 40000 0 3000 5000 55000 13000 ...
## $ OTHASS94: num 0 1 0 0 0 1 0 0 0 0 ...
## $ VALU94 : num 0 10000 0 0 0 1000 0 0 0 0 ...
## $ NFA94 : num 95000 -12500 23000 2400 88000 ...
## $ NW94 : num 165000 11500 27000 4400 188000 ...
## $ equity94: num 70000 24000 4000 2000 100000 0 47000 45000 122000 0 ...
## $ home94 : num 1 1 1 1 1 0 1 1 1 0 ...
# want to find out the type of data structure for a particular variable? Try it yourself.
Examine the summary statistics of variables in the dataset, what can you learn from it?
summary(wealth)
## S300 FAMID94 S302 Bus94
## Min. :2 Min. : 14 Min. :0.0000 Min. : 0
## 1st Qu.:2 1st Qu.: 5297 1st Qu.:0.0000 1st Qu.: 0
## Median :2 Median : 7908 Median :0.0000 Median : 0
## Mean :2 Mean : 8515 Mean :0.1042 Mean : 16430
## 3rd Qu.:2 3rd Qu.:12351 3rd Qu.:0.0000 3rd Qu.: 0
## Max. :2 Max. :16970 Max. :1.0000 Max. :5000000
## S304 CHK94 S306 DBT94
## Min. :0.0000 Min. : 0 Min. :0.0000 Min. : 0
## 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000 1st Qu.: 0
## Median :1.0000 Median : 1000 Median :0.0000 Median : 0
## Mean :0.6592 Mean : 13574 Mean :0.4711 Mean : 4937
## 3rd Qu.:1.0000 3rd Qu.: 8000 3rd Qu.:1.0000 3rd Qu.: 4000
## Max. :1.0000 Max. :1250000 Max. :1.0000 Max. :2000000
## S308 REAL94 S310 STOCK94
## Min. :0.0000 Min. : 0 Min. :0.0000 Min. : 0
## 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000 1st Qu.: 0
## Median :0.0000 Median : 0 Median :0.0000 Median : 0
## Mean :0.1365 Mean : 16373 Mean :0.2528 Mean : 17782
## 3rd Qu.:0.0000 3rd Qu.: 0 3rd Qu.:1.0000 3rd Qu.: 0
## Max. :1.0000 Max. :7000000 Max. :1.0000 Max. :9999997
## S313 OTHASS94 VALU94 NFA94
## Min. : 0 Min. :0.000 Min. : 0 Min. : -694000
## 1st Qu.: 700 1st Qu.:0.000 1st Qu.: 0 1st Qu.: 59
## Median : 5000 Median :0.000 Median : 0 Median : 9125
## Mean : 9017 Mean :0.207 Mean : 7572 Mean : 75811
## 3rd Qu.: 12000 3rd Qu.:0.000 3rd Qu.: 0 3rd Qu.: 48000
## Max. :270000 Max. :1.000 Max. :1100000 Max. :10284997
## NW94 equity94 home94
## Min. : -644000 Min. :-682000 Min. :0.0000
## 1st Qu.: 1500 1st Qu.: 0 1st Qu.:0.0000
## Median : 26200 Median : 5550 Median :1.0000
## Mean : 110187 Mean : 34376 Mean :0.5568
## 3rd Qu.: 100000 3rd Qu.: 45000 3rd Qu.:1.0000
## Max. :10584997 Max. :9999996 Max. :1.0000
summary(wealth$equity94)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -682000 0 5550 34376 45000 9999996
Create a histogram for the variable equity94 and examine the distribution, how would you describe the distribution?
hist(wealth$equity94)
Create a boxplot for equity94, what can you observe from the graph?
boxplot (wealth$equity94)
Let’s create a new variable named equity94R by subtracting net financial assets from net worth.
wealth$equity94R <- wealth$NW94 - wealth$NFA94
New variables can also be created using if-else statements The new variable home94R is equal to 0 if equity94 equals 0 and 1 otherwise.
wealth$home94R<-ifelse(wealth$equity94==0,0,1)
Sometimes, our minds play trick on us and you want to delete a variable. For instance, you can remove the variable “equity94R” from your wealth dataset as follows:
wealth$equity94R<-NULL
We can assign value labels to the variable home94R (0 = no, 1=yes).
wealth$home94R <- factor(wealth$home94R,levels = c(0,1),labels = c("no", "yes"))
Subset the data by a specific family id number by the family id number 16922, by the value of home94 equal to 1, and lastly, adding an additional condition that NFA94 has a value greater than 10000.
newdf <- subset(wealth, wealth$FAMID94==16922)
str(wealth$home94)
## num [1:8628] 1 1 1 1 1 0 1 1 1 0 ...
newdf1 <- subset(wealth, wealth$home94==1)
newdf2 <- subset(wealth, wealth$home94==1 & wealth$NFA94 > 10000)
Try it yourself!
Create a frequency table of a factor variable; The frequencies are ordered and labelled by the levels attribute of the factor.
table(wealth$home94R)
##
## no yes
## 3824 4804
# How many 0s and 1s in home94R?
R operators: R has several operators to perform tasks including arithmetic, logical and bitwise operations.
| Operator | Meaning |
|---|---|
| < | less than |
| <= | less than or equal to |
| > | greater than |
| >= | greater than or equal to |
| == | exactly equal to |
| != | not equal to |
| !x | Not x |
| x | y |
| x & y | x AND y |
| isTRUE(x) | test if X is TRUE |
When you quit RStudio, remember to agree to “save workspace image to ~/.Rdata” when the prompting window pops up.
?functionName or “Help” tab - look at R documentations
Google - “the error message”, “how to rename a variable in r”
R communities - look for posts on Stackoverflow.com/ , www.r-bloggers.com/
Some good advice - https://www.r-bloggers.com/the-5-most-effective-ways-to-learn-r/
My contact email - chiuwu@upenn.edu