Lecture I - Syntax and Objects in R

First lelve of Input.

For computer, anything you put there should be either Number, String(text) or Default syntax. When you do data analysis, you should ask yourself:

what kind of data types do my datasets include? Are they all numbers or strings ?
The default type is number Here is what you will get when you type hello in R or Python:

hello

## Error in eval(expr, envir, enclos): object 'hello' not found

But if you type any numbers like 156 or 3, it will show:

## [1] 156

2:8

## [1] 2 3 4 5 6 7 8

So, for any string or characteristic, you need use ``:

'Hello'

## [1] "Hello"

c <- "Hello"
c

## [1] "Hello"

Some words are used as snytax, which means it represent the certain function of R, for instance:

print

## function (x, ...) 
## UseMethod("print")
## <bytecode: 0x7fccf3d34030>
## <environment: namespace:base>

Welcome to the programming world, you just finish 20% of introductory level of computer science course!

The following code chunk is from python environemnt:

x = 'hello, python world!' 
x
print(x)

## hello, python world!

hello

## NameError: name 'hello' is not defined
## 
## Detailed traceback: 
##   File "<string>", line 1, in <module>

3.6
5*10
print(5*10)

## 50

Second Level of Input

We will just show how to build lists, matrices and dataframes,time series, etcs.

l <- list(1, 2, 3, 4) # it gives the list 
l

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4

class(l)

## [1] "list"

l[2] #[] is a index function, it can the specific elements in the list, we use it quite often

## [[1]]
## [1] 2

v <- c(1,2,3,4)
v

## [1] 1 2 3 4

v <- c("apple", 'orange', 'Mango')
v

## [1] "apple"  "orange" "Mango"

m <- matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, byrow = T) # this gives the matrix 
m

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

m[2,1] # index function

## [1] 4

n <- c(2, 3, 5) 
s <- c("aa", "bb", "cc") 
b <- c(TRUE, FALSE, TRUE) 
df <- data.frame(n, s, b) 
df

##   n  s     b
## 1 2 aa  TRUE
## 2 3 bb FALSE
## 3 5 cc  TRUE

df[2,2] # index function

## [1] bb
## Levels: aa bb cc

df[,2]

## [1] aa bb cc
## Levels: aa bb cc

df[2,]

##   n  s     b
## 2 3 bb FALSE

mtcars # mtcars is the dataframe

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

dim(mtcars) # gives the dimension of mtcars

## [1] 32 11

library(xts)
austres

##         Qtr1    Qtr2    Qtr3    Qtr4
## 1971         13067.3 13130.5 13198.4
## 1972 13254.2 13303.7 13353.9 13409.3
## 1973 13459.2 13504.5 13552.6 13614.3
## 1974 13669.5 13722.6 13772.1 13832.0
## 1975 13862.6 13893.0 13926.8 13968.9
## 1976 14004.7 14033.1 14066.0 14110.1
## 1977 14155.6 14192.2 14231.7 14281.5
## 1978 14330.3 14359.3 14396.6 14430.8
## 1979 14478.4 14515.7 14554.9 14602.5
## 1980 14646.4 14695.4 14746.6 14807.4
## 1981 14874.4 14923.3 14988.7 15054.1
## 1982 15121.7 15184.2 15239.3 15288.9
## 1983 15346.2 15393.5 15439.0 15483.5
## 1984 15531.5 15579.4 15628.5 15677.3
## 1985 15736.7 15788.3 15839.7 15900.6
## 1986 15961.5 16018.3 16076.9 16139.0
## 1987 16203.0 16263.3 16327.9 16398.9
## 1988 16478.3 16538.2 16621.6 16697.0
## 1989 16777.2 16833.1 16891.6 16956.8
## 1990 17026.3 17085.4 17106.9 17169.4
## 1991 17239.4 17292.0 17354.2 17414.2
## 1992 17447.3 17482.6 17526.0 17568.7
## 1993 17627.1 17661.5

class(austres) # no dimension but has index and coredata

## [1] "ts"

index(austres)

##  [1] 1971.25 1971.50 1971.75 1972.00 1972.25 1972.50 1972.75 1973.00
##  [9] 1973.25 1973.50 1973.75 1974.00 1974.25 1974.50 1974.75 1975.00
## [17] 1975.25 1975.50 1975.75 1976.00 1976.25 1976.50 1976.75 1977.00
## [25] 1977.25 1977.50 1977.75 1978.00 1978.25 1978.50 1978.75 1979.00
## [33] 1979.25 1979.50 1979.75 1980.00 1980.25 1980.50 1980.75 1981.00
## [41] 1981.25 1981.50 1981.75 1982.00 1982.25 1982.50 1982.75 1983.00
## [49] 1983.25 1983.50 1983.75 1984.00 1984.25 1984.50 1984.75 1985.00
## [57] 1985.25 1985.50 1985.75 1986.00 1986.25 1986.50 1986.75 1987.00
## [65] 1987.25 1987.50 1987.75 1988.00 1988.25 1988.50 1988.75 1989.00
## [73] 1989.25 1989.50 1989.75 1990.00 1990.25 1990.50 1990.75 1991.00
## [81] 1991.25 1991.50 1991.75 1992.00 1992.25 1992.50 1992.75 1993.00
## [89] 1993.25

coredata(austres)

##  [1] 13067.3 13130.5 13198.4 13254.2 13303.7 13353.9 13409.3 13459.2
##  [9] 13504.5 13552.6 13614.3 13669.5 13722.6 13772.1 13832.0 13862.6
## [17] 13893.0 13926.8 13968.9 14004.7 14033.1 14066.0 14110.1 14155.6
## [25] 14192.2 14231.7 14281.5 14330.3 14359.3 14396.6 14430.8 14478.4
## [33] 14515.7 14554.9 14602.5 14646.4 14695.4 14746.6 14807.4 14874.4
## [41] 14923.3 14988.7 15054.1 15121.7 15184.2 15239.3 15288.9 15346.2
## [49] 15393.5 15439.0 15483.5 15531.5 15579.4 15628.5 15677.3 15736.7
## [57] 15788.3 15839.7 15900.6 15961.5 16018.3 16076.9 16139.0 16203.0
## [65] 16263.3 16327.9 16398.9 16478.3 16538.2 16621.6 16697.0 16777.2
## [73] 16833.1 16891.6 16956.8 17026.3 17085.4 17106.9 17169.4 17239.4
## [81] 17292.0 17354.2 17414.2 17447.3 17482.6 17526.0 17568.7 17627.1
## [89] 17661.5

Case Study I

First, you will learn how to use R to downlad main Fianncial datasets. At this stage, we need use packages, which are the composed functions that programmers develop to solve certain problems. In this case study, we will use packages such as quantmod, xts, and zoo.

library(quantmod)
getSymbols("^GSPC") # get S&P 500 datasets

## [1] "GSPC"

sp <- GSPC # assign it to a new variable `sp'; you can also use =, such as sp = GSPC
head(sp) # Gives the first ten rows of whole datasets

##            GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume
## 2007-01-03   1418.03   1429.42  1407.86    1416.60  3429160000
## 2007-01-04   1416.60   1421.84  1408.43    1418.34  3004460000
## 2007-01-05   1418.34   1418.34  1405.75    1409.71  2919400000
## 2007-01-08   1409.26   1414.98  1403.97    1412.84  2763340000
## 2007-01-09   1412.84   1415.61  1405.42    1412.11  3038380000
## 2007-01-10   1408.70   1415.99  1405.32    1414.85  2764660000
##            GSPC.Adjusted
## 2007-01-03       1416.60
## 2007-01-04       1418.34
## 2007-01-05       1409.71
## 2007-01-08       1412.84
## 2007-01-09       1412.11
## 2007-01-10       1414.85

tail(sp) # gives the last ten rows of whole datasets

##            GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume
## 2018-06-28   2698.69   2724.34  2691.99    2716.31  3428140000
## 2018-06-29   2727.13   2743.26  2718.03    2718.37  3565620000
## 2018-07-02   2704.95   2727.26  2698.95    2726.71  3073650000
## 2018-07-03   2733.27   2736.58  2711.16    2713.22  1911470000
## 2018-07-05   2724.19   2737.83  2716.02    2736.61  2953420000
## 2018-07-06   2737.68   2764.41  2733.52    2759.82  2554780000
##            GSPC.Adjusted
## 2018-06-28       2716.31
## 2018-06-29       2718.37
## 2018-07-02       2726.71
## 2018-07-03       2713.22
## 2018-07-05       2736.61
## 2018-07-06       2759.82

dim(sp) # gives the dimenson

## [1] 2898    6

So, every time you deal with datasets, I strongly recommend to use package tidyverse, which includes all the tools you need for most data tidy jobs, such as merge the datasets, clean the datasets, and visualizing datasets. Basically, tidyverse can cover 90% of you data cleaning work.

library(tidyverse)
class(sp) # class function tells you what kind of class the dataset belong to.

## [1] "xts" "zoo"

names(sp) # there is no time, unless you do time series analysis in R, otherwise you need dates in Stata.

## [1] "GSPC.Open"     "GSPC.High"     "GSPC.Low"      "GSPC.Close"   
## [5] "GSPC.Volume"   "GSPC.Adjusted"

# write.csv(sp, file = '/Users/Michael/Desktop/sp.csv')
# So, when you try to write it into csv file, you won't find datas there. 
# so, we need convert this time series data sets into data frame format 
sp_df <- as_data_frame(coredata(sp)) # function from tidyverse (subpackage - dplyr)
sp_df

## # A tibble: 2,898 x 6
##    GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted
##        <dbl>     <dbl>    <dbl>      <dbl>       <dbl>         <dbl>
##  1     1418.     1429.    1408.      1417.  3429160000         1417.
##  2     1417.     1422.    1408.      1418.  3004460000         1418.
##  3     1418.     1418.    1406.      1410.  2919400000         1410.
##  4     1409.     1415.    1404.      1413.  2763340000         1413.
##  5     1413.     1416.    1405.      1412.  3038380000         1412.
##  6     1409.     1416.    1405.      1415.  2764660000         1415.
##  7     1415.     1427.    1415.      1424.  2857870000         1424.
##  8     1424.     1431.    1423.      1431.  2686480000         1431.
##  9     1431.     1434.    1429.      1432.  2599530000         1432.
## 10     1432.     1435.    1429.      1431.  2690270000         1431.
## # ... with 2,888 more rows

sp_df$Dates <- index(sp)
sp_df # now we have date

## # A tibble: 2,898 x 7
##    GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted
##        <dbl>     <dbl>    <dbl>      <dbl>       <dbl>         <dbl>
##  1     1418.     1429.    1408.      1417.  3429160000         1417.
##  2     1417.     1422.    1408.      1418.  3004460000         1418.
##  3     1418.     1418.    1406.      1410.  2919400000         1410.
##  4     1409.     1415.    1404.      1413.  2763340000         1413.
##  5     1413.     1416.    1405.      1412.  3038380000         1412.
##  6     1409.     1416.    1405.      1415.  2764660000         1415.
##  7     1415.     1427.    1415.      1424.  2857870000         1424.
##  8     1424.     1431.    1423.      1431.  2686480000         1431.
##  9     1431.     1434.    1429.      1432.  2599530000         1432.
## 10     1432.     1435.    1429.      1431.  2690270000         1431.
## # ... with 2,888 more rows, and 1 more variable: Dates <date>

Now, let’s do some advanced stuff, which includes subsetting the datasets and select the main variables we are interested on.

names(sp_df)

## [1] "GSPC.Open"     "GSPC.High"     "GSPC.Low"      "GSPC.Close"   
## [5] "GSPC.Volume"   "GSPC.Adjusted" "Dates"

names(sp_df) <- c('Open', "High", "Low", "Close", "Volume", "Adjusted", "Dates")
sp_df

## # A tibble: 2,898 x 7
##     Open  High   Low Close     Volume Adjusted Dates     
##    <dbl> <dbl> <dbl> <dbl>      <dbl>    <dbl> <date>    
##  1 1418. 1429. 1408. 1417. 3429160000    1417. 2007-01-03
##  2 1417. 1422. 1408. 1418. 3004460000    1418. 2007-01-04
##  3 1418. 1418. 1406. 1410. 2919400000    1410. 2007-01-05
##  4 1409. 1415. 1404. 1413. 2763340000    1413. 2007-01-08
##  5 1413. 1416. 1405. 1412. 3038380000    1412. 2007-01-09
##  6 1409. 1416. 1405. 1415. 2764660000    1415. 2007-01-10
##  7 1415. 1427. 1415. 1424. 2857870000    1424. 2007-01-11
##  8 1424. 1431. 1423. 1431. 2686480000    1431. 2007-01-12
##  9 1431. 1434. 1429. 1432. 2599530000    1432. 2007-01-16
## 10 1432. 1435. 1429. 1431. 2690270000    1431. 2007-01-17
## # ... with 2,888 more rows

sp_df_1 <- select(sp_df, Open:Close, Adjusted, Dates)
sp_df_2 <- select(sp_df, Open, Close, Volume:Dates)
sp_df_3 <- select(sp_df, -Adjusted) 
mean(sp_df$Volume)

## [1] 3969124134

sp_df_vl <- filter(sp_df, Volume >= mean(sp_df$Volume)) # you can filter anything you want
library(lubridate)
sp_df_dt <- filter(sp_df, Dates >=  ymd(20100101)) # all data after 2010 s

# write_csv()

Lecture I - Syntax and Objects in R

Michael

2018-07-09

First lelve of Input.

Second Level of Input

Case Study I