projects)Let’s start a new project. We can call it “ba” for business analytics. Click on “Project” icon in the upper right corner of R Studio. Select ‘New Project’ a ‘New directory’ choose a folder on your computer where you would like to keep your file for this class.
Let’s open a new R Markdown document. R Markdown combines text and R code. R code is enclosed in and
marks. For example, if we want to read in some data we type the following:
To execute the R code (rather than knit the entire R Markdown document), we put the cursor on the line we want to execute and either click run, or hold ‘control’ key and press ‘enter’.
Function read.csv() reads comma separated files. It takes the location of that file as an argument. The location can be a path on your hard drive or a URL. In this case the location was URL of a file is on Yahoo!’s website.
The <- operator assigns the result of the read.csv() function to an object we named ‘mydata’. The result of the read.csv() function is an object called data frame. You can examine this data frame by clicking on its name in the Environment window.
Data frame is an object in R that holds variables and observations. Let’s examine the structure of our data frame ‘mydata’. We can do this by executing str() function. (We can do this by either entering the function into the console, or if we want it part of the markdown document, entering it and executing it within the markdown document.)
str(mydata)
## 'data.frame': 16862 obs. of 7 variables:
## $ Date : Factor w/ 16862 levels "1950-01-03","1950-01-04",..: 16862 16861 16860 16859 16858 16857 16856 16855 16854 16853 ...
## $ Open : num 2268 2262 2252 2252 2250 ...
## $ High : num 2272 2273 2264 2254 2255 ...
## $ Low : num 2260 2262 2245 2234 2245 ...
## $ Close : num 2269 2271 2258 2239 2249 ...
## $ Volume : num 3.76e+09 3.76e+09 3.77e+09 2.67e+09 2.34e+09 ...
## $ Adj.Close: num 2269 2271 2258 2239 2249 ...
The results tell us that we have over 16 thousand observations and 7 variables. It tells us the variable names, their type and the first few observations. We see that variables Open, High, Low etc. are numerical (‘num’). Date is a Factor. Factors are variables that usually take on limited number of values. These limited values are called factor levels. Factor variables are stored as numbers but displayed with character values.
If we want to do something with a particular variable we write the name of the data frame, a ‘$’ sign, and the name of the variable. For example, below we calculate the summary of the variable Close in data frame mydata:
summary(mydata$Close)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.66 84.50 156.70 508.70 1008.00 2272.00
We can change the type of a variable. For example, in this case we may want to change the Date variable from a ‘Factor’ type to a ‘Date’ type. In other words we would like R to understand that Date is not just any old set of characters but rather a calendar date. We can do this using the as.Date() function. The function takes as arguments the variable we would like to convert, and the format of that variable. We assign the result to variable Date in data frame mydata overwriting the existing variable.
mydata$Date <- as.Date(mydata$Date, "%Y-%m-%d")
Let’s check the type of our variables:
str(mydata$Date)
## Date[1:16862], format: "2017-01-05" "2017-01-04" "2017-01-03" "2016-12-30" ...
We can also run a summary on this variable so we know when our data begin and end.
summary(mydata$Date)
## Min. 1st Qu. Median Mean 3rd Qu.
## "1950-01-03" "1966-10-07" "1983-07-30" "1983-07-18" "2000-04-03"
## Max.
## "2017-01-05"
R packages are add-ons to base R. They have to be installed in order for you to use them. You can install them by clicking on the ‘Packages’ tab in the bottom right panel of R Studio, and clicking on ‘Install Packages’ button. You can select a package and click the ‘Install’ button.
The packages that are particularly useful for this class are dplyr, tidyr, lubridate, ggplot2, stargazer. You should install these on your computer. You only need to install a package once. However, you need to ‘load’ the package into each new R session using the library() command. For example, if we plan to use commands from the ggplot2 package we should include the following in our code:
library(ggplot2)
Let’s plot the closing value of the S&P 500 in our data set. We use a powerful function called ggplot(). The function creates a plot by combining a few components. First, it needs to know the data frame from which to get the data. Second, it needs to know which variables should be on the x and y axes. This is specified in aesthetics(). Finally, it needs to know the geometric object (geom) we want to use to represent the data (in this case a line).
ggplot(data=mydata, aes(x=Date, y=Close)) + geom_line()
Write a new R Markdown file that does the analysis and answers the questions below. Knit your R Markdown, print it and bring to class.
Date type variable.ggplot terminology). E.g. 100,200,400 etc.