My plan to clean the datasets: 1. The datasets I want to use - 1) How ESG equities performed in general (building on the exercise we did in class); 2) Stock prices before and during the pandemic from Yahoo, which could be attained thourgh Tidyquant directly from R; 3) Categorization of the holders of assets, i.e. the companies, in an attempt to give the audience a clearer picture of how these companies are doing. To do that, I plan to categorize the holders into two groups - one is the ESG leader group, graded by MSCI; the other is other holders in S&P500. The leader’s dataset can be found on the website of iShares. 2. Set up the folders structure for cleaned data
It seems that the pre and in pandemic data header names are somewhat different than the ESG leader dataset’s. Here I’m trying to rename the headers so there won’t be errors when it comes to combining data sets.
Here I’m trying to get S&P500 stock prices before the pandemic:
SP_prices_pre_pandemic <- tq_index("SP500") %>%
tq_get(get = "stock.prices",
from = "2019-03-13",
to = "2020-03-13") %>%
mutate(shares_held = shares_held/10^6)
Getting holdings for SP500
Warning: x = 'CARR', get = 'stock.prices': Error in getSymbols.yahoo(Symbols = "CARR", env = <environment>, verbose = FALSE, : Unable to import “CARR”.
CARR download failed after two attempts. Error message:
HTTP error 400.
Removing CARR.Warning: x = 'OTIS', get = 'stock.prices': Error in getSymbols.yahoo(Symbols = "OTIS", env = <environment>, verbose = FALSE, : Unable to import “OTIS”.
OTIS download failed after two attempts. Error message:
HTTP error 400.
Removing OTIS.Warning: x = 'CEG', get = 'stock.prices': Error in getSymbols.yahoo(Symbols = "CEG", env = <environment>, verbose = FALSE, : Unable to import “CEG”.
CEG download failed after two attempts. Error message:
HTTP error 400.
Removing CEG.Warning: x = 'OGN', get = 'stock.prices': Error in getSymbols.yahoo(Symbols = "OGN", env = <environment>, verbose = FALSE, : Unable to import “OGN”.
OGN download failed after two attempts. Error message:
HTTP error 400.
Removing OGN.
SP_prices_pre_pandemic
Take a glimpse of the dataset:
SP_prices_pre_pandemic %>% glimpse()
Rows: 126,191
Columns: 15
$ symbol <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "…
$ company <chr> "Apple Inc.", "Apple Inc.", "Apple Inc.", "Apple …
$ identifier <chr> "03783310", "03783310", "03783310", "03783310", "…
$ sedol <chr> "2046251", "2046251", "2046251", "2046251", "2046…
$ weight <dbl> 0.06757584, 0.06757584, 0.06757584, 0.06757584, 0…
$ sector <chr> "Information Technology", "Information Technology…
$ shares_held <dbl> 169.2767, 169.2767, 169.2767, 169.2767, 169.2767,…
$ local_currency <chr> "USD", "USD", "USD", "USD", "USD", "USD", "USD", …
$ date <date> 2019-03-13, 2019-03-14, 2019-03-15, 2019-03-18, …
$ open <dbl> 45.5625, 45.9750, 46.2125, 46.4500, 47.0875, 46.5…
$ high <dbl> 45.8250, 46.0250, 46.8325, 47.0975, 47.2475, 47.3…
$ low <dbl> 45.2300, 45.6400, 45.9350, 46.4475, 46.4800, 46.1…
$ close <dbl> 45.4275, 45.9325, 46.5300, 47.0050, 46.6325, 47.0…
$ volume <dbl> 124130000, 94318000, 156171600, 104879200, 126585…
$ adjusted <dbl> 44.03357, 44.52307, 45.10223, 45.56266, 45.20159,…
Change the headers:
colnames(SP_prices_pre_pandemic) <- c("Ticker", "Company", "Identifier",
"Sedol", "Weight", "Sector",
"Shares_held_mn", "Local Currency",
"Date", "Open", "High", "Low",
"Close", "Volume", "Adjusted")
SP_prices_pre_pandemic %>% glimpse()
Rows: 126,191
Columns: 15
$ Ticker <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL",…
$ Company <chr> "Apple Inc.", "Apple Inc.", "Apple Inc.", "Appl…
$ Identifier <chr> "03783310", "03783310", "03783310", "03783310",…
$ Sedol <chr> "2046251", "2046251", "2046251", "2046251", "20…
$ Weight <dbl> 0.06757584, 0.06757584, 0.06757584, 0.06757584,…
$ Sector <chr> "Information Technology", "Information Technolo…
$ Shares_held_mn <dbl> 169.2767, 169.2767, 169.2767, 169.2767, 169.276…
$ `Local Currency` <chr> "USD", "USD", "USD", "USD", "USD", "USD", "USD"…
$ Date <date> 2019-03-13, 2019-03-14, 2019-03-15, 2019-03-18…
$ Open <dbl> 45.5625, 45.9750, 46.2125, 46.4500, 47.0875, 46…
$ High <dbl> 45.8250, 46.0250, 46.8325, 47.0975, 47.2475, 47…
$ Low <dbl> 45.2300, 45.6400, 45.9350, 46.4475, 46.4800, 46…
$ Close <dbl> 45.4275, 45.9325, 46.5300, 47.0050, 46.6325, 47…
$ Volume <dbl> 124130000, 94318000, 156171600, 104879200, 1265…
$ Adjusted <dbl> 44.03357, 44.52307, 45.10223, 45.56266, 45.2015…
Since there are some information not so relevant to our study, we can drop them here
SP_prices_pre_pandemic_clean = subset(SP_prices_pre_pandemic,
select = -c(Identifier, Sedol, Open,
High, Low, Close, Volume))
SP_prices_pre_pandemic_clean
Repeat the steps for stock prices in the pandemic, starting by importing the stock prices of S&P500 holders in the pandemic:
SP_prices_in_pandemic <- tq_index("SP500") %>%
tq_get(get = "stock.prices",
from = "2020-03-14",
to = "2021-03-14") %>%
mutate(shares_held = shares_held/10^6)
Getting holdings for SP500
Warning: x = 'CEG', get = 'stock.prices': Error in getSymbols.yahoo(Symbols = "CEG", env = <environment>, verbose = FALSE, : Unable to import “CEG”.
CEG download failed after two attempts. Error message:
HTTP error 400.
Removing CEG.Warning: x = 'OGN', get = 'stock.prices': Error in getSymbols.yahoo(Symbols = "OGN", env = <environment>, verbose = FALSE, : Unable to import “OGN”.
OGN download failed after two attempts. Error message:
HTTP error 400.
Removing OGN.
SP_prices_in_pandemic
Take a glimpse:
SP_prices_in_pandemic %>% glimpse()
Rows: 125,745
Columns: 15
$ symbol <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "…
$ company <chr> "Apple Inc.", "Apple Inc.", "Apple Inc.", "Apple …
$ identifier <chr> "03783310", "03783310", "03783310", "03783310", "…
$ sedol <chr> "2046251", "2046251", "2046251", "2046251", "2046…
$ weight <dbl> 0.06757584, 0.06757584, 0.06757584, 0.06757584, 0…
$ sector <chr> "Information Technology", "Information Technology…
$ shares_held <dbl> 169.2767, 169.2767, 169.2767, 169.2767, 169.2767,…
$ local_currency <chr> "USD", "USD", "USD", "USD", "USD", "USD", "USD", …
$ date <date> 2020-03-16, 2020-03-17, 2020-03-18, 2020-03-19, …
$ open <dbl> 60.4875, 61.8775, 59.9425, 61.8475, 61.7950, 57.0…
$ high <dbl> 64.7700, 64.4025, 62.5000, 63.2100, 62.9575, 57.1…
$ low <dbl> 60.0000, 59.6000, 59.2800, 60.6525, 57.0000, 53.1…
$ close <dbl> 60.5525, 63.2150, 61.6675, 61.1950, 57.3100, 56.0…
$ volume <dbl> 322423600, 324056000, 300233600, 271857200, 40169…
$ adjusted <dbl> 59.46272, 62.07730, 60.55764, 60.09365, 56.27856,…
Then change the headers:
colnames(SP_prices_in_pandemic) <- c("Ticker", "Company", "Identifier",
"Sedol", "Weight", "Sector",
"Shares_held_mn", "Local Currency",
"Date", "Open", "High", "Low",
"Close", "Volume", "Adjusted")
SP_prices_in_pandemic %>% glimpse()
Rows: 125,745
Columns: 15
$ Ticker <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL",…
$ Company <chr> "Apple Inc.", "Apple Inc.", "Apple Inc.", "Appl…
$ Identifier <chr> "03783310", "03783310", "03783310", "03783310",…
$ Sedol <chr> "2046251", "2046251", "2046251", "2046251", "20…
$ Weight <dbl> 0.06757584, 0.06757584, 0.06757584, 0.06757584,…
$ Sector <chr> "Information Technology", "Information Technolo…
$ Shares_held_mn <dbl> 169.2767, 169.2767, 169.2767, 169.2767, 169.276…
$ `Local Currency` <chr> "USD", "USD", "USD", "USD", "USD", "USD", "USD"…
$ Date <date> 2020-03-16, 2020-03-17, 2020-03-18, 2020-03-19…
$ Open <dbl> 60.4875, 61.8775, 59.9425, 61.8475, 61.7950, 57…
$ High <dbl> 64.7700, 64.4025, 62.5000, 63.2100, 62.9575, 57…
$ Low <dbl> 60.0000, 59.6000, 59.2800, 60.6525, 57.0000, 53…
$ Close <dbl> 60.5525, 63.2150, 61.6675, 61.1950, 57.3100, 56…
$ Volume <dbl> 322423600, 324056000, 300233600, 271857200, 401…
$ Adjusted <dbl> 59.46272, 62.07730, 60.55764, 60.09365, 56.2785…
And drop some irrelevant columns:
SP_prices_in_pandemic_clean = subset(SP_prices_in_pandemic,
select = -c(Identifier, Sedol, Open,
High, Low, Close, Volume))
SP_prices_in_pandemic_clean
Only one thing before we combine the datasets together - change the column name of ESG leaders
colnames(iShares_ESG_MSCI_USA_Leaders_ETF_fund)[2] <- "Company"
iShares_ESG_MSCI_USA_Leaders_ETF_fund
Now combining datasets to get the ESG leaders of S&P500!
ESG_leader_pre_pandemic_prices <- semi_join(SP_prices_pre_pandemic_clean,
iShares_ESG_MSCI_USA_Leaders_ETF_fund, by = "Ticker")
ESG_leader_pre_pandemic_prices
ESG_leader_in_pandemic_prices <- semi_join(SP_prices_in_pandemic_clean,
iShares_ESG_MSCI_USA_Leaders_ETF_fund, by = "Ticker")
ESG_leader_in_pandemic_prices
The rest would be the other group that we want to compare with the former one
Non_ESG_Leader_pre_pandemic <- anti_join(SP_prices_pre_pandemic_clean,
iShares_ESG_MSCI_USA_Leaders_ETF_fund, by = "Ticker")
Non_ESG_Leader_pre_pandemic
Non_ESG_Leader_in_pandemic <- anti_join(SP_prices_in_pandemic_clean,
iShares_ESG_MSCI_USA_Leaders_ETF_fund, by = "Ticker")
Non_ESG_Leader_in_pandemic
Exporting the processed data sets
OutPath <- "/Users/chenyiran/Desktop/SAIS Fall 2022/Sustainable Finance/Final Project/03_data_processed/"
write_csv(ESG_leader_pre_pandemic_prices, "ESG leader before pandemic")
write_csv(ESG_leader_in_pandemic_prices, "ESG leader in pandemic")
write_csv(Non_ESG_Leader_pre_pandemic, "Non ESG leader before pandemic")
write_csv(Non_ESG_Leader_in_pandemic, "Non ESG leader in pandemic")