First lelve of Input.
For computer, anything you put there should be either Number, String(text) or Default syntax. When you do data analysis, you should ask yourself:
- what kind of data types do my datasets include? Are they all numbers or strings ?
- The default type is number Here is what you will get when you type hello in R or Python:
hello
## Error in eval(expr, envir, enclos): object 'hello' not found
But if you type any numbers like 156 or 3, it will show:
156
## [1] 156
2:8
## [1] 2 3 4 5 6 7 8
So, for any string or characteristic, you need use ``:
'Hello'
## [1] "Hello"
or
c <- "Hello"
c
## [1] "Hello"
Some words are used as snytax, which means it represent the certain function of R, for instance:
print
## function (x, ...)
## UseMethod("print")
## <bytecode: 0x7fccf3d34030>
## <environment: namespace:base>
Welcome to the programming world, you just finish 20% of introductory level of computer science course!
The following code chunk is from python environemnt:
x = 'hello, python world!'
x
print(x)
## hello, python world!
hello
## NameError: name 'hello' is not defined
##
## Detailed traceback:
## File "<string>", line 1, in <module>
3.6
5*10
print(5*10)
## 50
Second Level of Input
We will just show how to build lists, matrices and dataframes,time series, etcs.
l <- list(1, 2, 3, 4) # it gives the list
l
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] 4
class(l)
## [1] "list"
l[2] #[] is a index function, it can the specific elements in the list, we use it quite often
## [[1]]
## [1] 2
v <- c(1,2,3,4)
v
## [1] 1 2 3 4
v <- c("apple", 'orange', 'Mango')
v
## [1] "apple" "orange" "Mango"
m <- matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, byrow = T) # this gives the matrix
m
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
m[2,1] # index function
## [1] 4
n <- c(2, 3, 5)
s <- c("aa", "bb", "cc")
b <- c(TRUE, FALSE, TRUE)
df <- data.frame(n, s, b)
df
## n s b
## 1 2 aa TRUE
## 2 3 bb FALSE
## 3 5 cc TRUE
df[2,2] # index function
## [1] bb
## Levels: aa bb cc
df[,2]
## [1] aa bb cc
## Levels: aa bb cc
df[2,]
## n s b
## 2 3 bb FALSE
mtcars # mtcars is the dataframe
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
dim(mtcars) # gives the dimension of mtcars
## [1] 32 11
library(xts)
austres
## Qtr1 Qtr2 Qtr3 Qtr4
## 1971 13067.3 13130.5 13198.4
## 1972 13254.2 13303.7 13353.9 13409.3
## 1973 13459.2 13504.5 13552.6 13614.3
## 1974 13669.5 13722.6 13772.1 13832.0
## 1975 13862.6 13893.0 13926.8 13968.9
## 1976 14004.7 14033.1 14066.0 14110.1
## 1977 14155.6 14192.2 14231.7 14281.5
## 1978 14330.3 14359.3 14396.6 14430.8
## 1979 14478.4 14515.7 14554.9 14602.5
## 1980 14646.4 14695.4 14746.6 14807.4
## 1981 14874.4 14923.3 14988.7 15054.1
## 1982 15121.7 15184.2 15239.3 15288.9
## 1983 15346.2 15393.5 15439.0 15483.5
## 1984 15531.5 15579.4 15628.5 15677.3
## 1985 15736.7 15788.3 15839.7 15900.6
## 1986 15961.5 16018.3 16076.9 16139.0
## 1987 16203.0 16263.3 16327.9 16398.9
## 1988 16478.3 16538.2 16621.6 16697.0
## 1989 16777.2 16833.1 16891.6 16956.8
## 1990 17026.3 17085.4 17106.9 17169.4
## 1991 17239.4 17292.0 17354.2 17414.2
## 1992 17447.3 17482.6 17526.0 17568.7
## 1993 17627.1 17661.5
class(austres) # no dimension but has index and coredata
## [1] "ts"
index(austres)
## [1] 1971.25 1971.50 1971.75 1972.00 1972.25 1972.50 1972.75 1973.00
## [9] 1973.25 1973.50 1973.75 1974.00 1974.25 1974.50 1974.75 1975.00
## [17] 1975.25 1975.50 1975.75 1976.00 1976.25 1976.50 1976.75 1977.00
## [25] 1977.25 1977.50 1977.75 1978.00 1978.25 1978.50 1978.75 1979.00
## [33] 1979.25 1979.50 1979.75 1980.00 1980.25 1980.50 1980.75 1981.00
## [41] 1981.25 1981.50 1981.75 1982.00 1982.25 1982.50 1982.75 1983.00
## [49] 1983.25 1983.50 1983.75 1984.00 1984.25 1984.50 1984.75 1985.00
## [57] 1985.25 1985.50 1985.75 1986.00 1986.25 1986.50 1986.75 1987.00
## [65] 1987.25 1987.50 1987.75 1988.00 1988.25 1988.50 1988.75 1989.00
## [73] 1989.25 1989.50 1989.75 1990.00 1990.25 1990.50 1990.75 1991.00
## [81] 1991.25 1991.50 1991.75 1992.00 1992.25 1992.50 1992.75 1993.00
## [89] 1993.25
coredata(austres)
## [1] 13067.3 13130.5 13198.4 13254.2 13303.7 13353.9 13409.3 13459.2
## [9] 13504.5 13552.6 13614.3 13669.5 13722.6 13772.1 13832.0 13862.6
## [17] 13893.0 13926.8 13968.9 14004.7 14033.1 14066.0 14110.1 14155.6
## [25] 14192.2 14231.7 14281.5 14330.3 14359.3 14396.6 14430.8 14478.4
## [33] 14515.7 14554.9 14602.5 14646.4 14695.4 14746.6 14807.4 14874.4
## [41] 14923.3 14988.7 15054.1 15121.7 15184.2 15239.3 15288.9 15346.2
## [49] 15393.5 15439.0 15483.5 15531.5 15579.4 15628.5 15677.3 15736.7
## [57] 15788.3 15839.7 15900.6 15961.5 16018.3 16076.9 16139.0 16203.0
## [65] 16263.3 16327.9 16398.9 16478.3 16538.2 16621.6 16697.0 16777.2
## [73] 16833.1 16891.6 16956.8 17026.3 17085.4 17106.9 17169.4 17239.4
## [81] 17292.0 17354.2 17414.2 17447.3 17482.6 17526.0 17568.7 17627.1
## [89] 17661.5
Case Study I
First, you will learn how to use R to downlad main Fianncial datasets. At this stage, we need use packages, which are the composed functions that programmers develop to solve certain problems. In this case study, we will use packages such as quantmod
, xts
, and zoo
.
library(quantmod)
getSymbols("^GSPC") # get S&P 500 datasets
## [1] "GSPC"
sp <- GSPC # assign it to a new variable `sp'; you can also use =, such as sp = GSPC
head(sp) # Gives the first ten rows of whole datasets
## GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume
## 2007-01-03 1418.03 1429.42 1407.86 1416.60 3429160000
## 2007-01-04 1416.60 1421.84 1408.43 1418.34 3004460000
## 2007-01-05 1418.34 1418.34 1405.75 1409.71 2919400000
## 2007-01-08 1409.26 1414.98 1403.97 1412.84 2763340000
## 2007-01-09 1412.84 1415.61 1405.42 1412.11 3038380000
## 2007-01-10 1408.70 1415.99 1405.32 1414.85 2764660000
## GSPC.Adjusted
## 2007-01-03 1416.60
## 2007-01-04 1418.34
## 2007-01-05 1409.71
## 2007-01-08 1412.84
## 2007-01-09 1412.11
## 2007-01-10 1414.85
tail(sp) # gives the last ten rows of whole datasets
## GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume
## 2018-06-28 2698.69 2724.34 2691.99 2716.31 3428140000
## 2018-06-29 2727.13 2743.26 2718.03 2718.37 3565620000
## 2018-07-02 2704.95 2727.26 2698.95 2726.71 3073650000
## 2018-07-03 2733.27 2736.58 2711.16 2713.22 1911470000
## 2018-07-05 2724.19 2737.83 2716.02 2736.61 2953420000
## 2018-07-06 2737.68 2764.41 2733.52 2759.82 2554780000
## GSPC.Adjusted
## 2018-06-28 2716.31
## 2018-06-29 2718.37
## 2018-07-02 2726.71
## 2018-07-03 2713.22
## 2018-07-05 2736.61
## 2018-07-06 2759.82
dim(sp) # gives the dimenson
## [1] 2898 6
So, every time you deal with datasets, I strongly recommend to use package tidyverse
, which includes all the tools you need for most data tidy jobs, such as merge the datasets, clean the datasets, and visualizing datasets. Basically, tidyverse
can cover 90% of you data cleaning work.
library(tidyverse)
class(sp) # class function tells you what kind of class the dataset belong to.
## [1] "xts" "zoo"
names(sp) # there is no time, unless you do time series analysis in R, otherwise you need dates in Stata.
## [1] "GSPC.Open" "GSPC.High" "GSPC.Low" "GSPC.Close"
## [5] "GSPC.Volume" "GSPC.Adjusted"
# write.csv(sp, file = '/Users/Michael/Desktop/sp.csv')
# So, when you try to write it into csv file, you won't find datas there.
# so, we need convert this time series data sets into data frame format
sp_df <- as_data_frame(coredata(sp)) # function from tidyverse (subpackage - dplyr)
sp_df
## # A tibble: 2,898 x 6
## GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1418. 1429. 1408. 1417. 3429160000 1417.
## 2 1417. 1422. 1408. 1418. 3004460000 1418.
## 3 1418. 1418. 1406. 1410. 2919400000 1410.
## 4 1409. 1415. 1404. 1413. 2763340000 1413.
## 5 1413. 1416. 1405. 1412. 3038380000 1412.
## 6 1409. 1416. 1405. 1415. 2764660000 1415.
## 7 1415. 1427. 1415. 1424. 2857870000 1424.
## 8 1424. 1431. 1423. 1431. 2686480000 1431.
## 9 1431. 1434. 1429. 1432. 2599530000 1432.
## 10 1432. 1435. 1429. 1431. 2690270000 1431.
## # ... with 2,888 more rows
sp_df$Dates <- index(sp)
sp_df # now we have date
## # A tibble: 2,898 x 7
## GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1418. 1429. 1408. 1417. 3429160000 1417.
## 2 1417. 1422. 1408. 1418. 3004460000 1418.
## 3 1418. 1418. 1406. 1410. 2919400000 1410.
## 4 1409. 1415. 1404. 1413. 2763340000 1413.
## 5 1413. 1416. 1405. 1412. 3038380000 1412.
## 6 1409. 1416. 1405. 1415. 2764660000 1415.
## 7 1415. 1427. 1415. 1424. 2857870000 1424.
## 8 1424. 1431. 1423. 1431. 2686480000 1431.
## 9 1431. 1434. 1429. 1432. 2599530000 1432.
## 10 1432. 1435. 1429. 1431. 2690270000 1431.
## # ... with 2,888 more rows, and 1 more variable: Dates <date>
Now, let’s do some advanced stuff, which includes subsetting the datasets and select the main variables we are interested on.
names(sp_df)
## [1] "GSPC.Open" "GSPC.High" "GSPC.Low" "GSPC.Close"
## [5] "GSPC.Volume" "GSPC.Adjusted" "Dates"
names(sp_df) <- c('Open', "High", "Low", "Close", "Volume", "Adjusted", "Dates")
sp_df
## # A tibble: 2,898 x 7
## Open High Low Close Volume Adjusted Dates
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <date>
## 1 1418. 1429. 1408. 1417. 3429160000 1417. 2007-01-03
## 2 1417. 1422. 1408. 1418. 3004460000 1418. 2007-01-04
## 3 1418. 1418. 1406. 1410. 2919400000 1410. 2007-01-05
## 4 1409. 1415. 1404. 1413. 2763340000 1413. 2007-01-08
## 5 1413. 1416. 1405. 1412. 3038380000 1412. 2007-01-09
## 6 1409. 1416. 1405. 1415. 2764660000 1415. 2007-01-10
## 7 1415. 1427. 1415. 1424. 2857870000 1424. 2007-01-11
## 8 1424. 1431. 1423. 1431. 2686480000 1431. 2007-01-12
## 9 1431. 1434. 1429. 1432. 2599530000 1432. 2007-01-16
## 10 1432. 1435. 1429. 1431. 2690270000 1431. 2007-01-17
## # ... with 2,888 more rows
sp_df_1 <- select(sp_df, Open:Close, Adjusted, Dates)
sp_df_2 <- select(sp_df, Open, Close, Volume:Dates)
sp_df_3 <- select(sp_df, -Adjusted)
mean(sp_df$Volume)
## [1] 3969124134
sp_df_vl <- filter(sp_df, Volume >= mean(sp_df$Volume)) # you can filter anything you want
library(lubridate)
sp_df_dt <- filter(sp_df, Dates >= ymd(20100101)) # all data after 2010 s
# write_csv()