On the top left is the script window. This is where we are going to write all of our code and notes.
On the lower left there’s the console window. This is where R tells us what it thinks we told it and then the answer. This part is what R would look like (without RStudio)
The top right has the environment and history tabs. The environment is a list of all objects that R knows and the history tab shows all the code that has been run. This is where excel and datasets will appear.
On the bottom right there’s a window with lots of tabs. Files provides the file structure in the working directory. Plots is where your visualizations will appear. Packages shows all of the installed packages and one that are checked are opened. Help is where we will learn about functions when we need assistance and the Viewer is for viewing other kinds of output, like web content.
So that all of our set-ups look the same follow me to change a few settings. Go to Tools –> Global Options
Once these options are set, they will remain active for every R session you open on your local machine so you will not need to follow these steps again.
Command + Option + I
\(\overline{y}=mean(y_{i})\)
\(\overline{y}=\sum_{n=1}^{10} y_{i}/n\)
For example, our mean was 17.83. For the value of 20, the deviation would be equal to 20 – 17.83 = 2.17. For the value of 12, the deviation would be 12 – 17.83 = -5.83.
If we do those for all of the values and then sum it up, we would get 0, which is not very useful. We need a method to get the absolute value of the deviations and, in a way, average them.
Use the Sum of Squares
Variance= \(s^2 = SS/(n-1)\)
Standard deviation= \(\sqrt{s^2}= \sqrt{ss/(n-1)}\)
A function is a verb; it tells R to do something. To call an R function, we call the name of the function followed directly by (). Recall our use of the read.csv() and View() functions. The items passed to the function inside the () are called arguments. Arguments change the way a function behaves
Example: Numeric variables
object1 <- c(55, 60, 35, 70)
Taking a Sum:
sum(object1)
## [1] 220
Taking the Mean:
mean(object1)
## [1] 55
Taking the Square root:
sqrt(object1)
## [1] 7.416198 7.745967 5.916080 8.366600
Taking the standard deviation:
sd(object1)
## [1] 14.7196
Most functions in R are vectorized meaning that they will work on a vector as well as a single value. This means that in R, we usually do not need to write loops like we would in other languages.
Summary is a function that is useful for numeric variables: It will give you the minimum, Q1, Median, Mean, Q3, and Maximum
summary(object1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.0 50.0 57.5 55.0 62.5 70.0
The first thing we are going to do is read in data from a csv file. We are initially going to use the read.csv function which comes with your normal Base R installation. Later on, we will use a different version of this function. For now, just run the code and see what happens.
chocolate <- read.csv("chocolate.csv")
To View() the dataset in spreadsheet form, we can click on the dataset’s name in the Environment tab. Notice that this action is accompanied by some code in the console telling us that we could also get there using code. Let’s try it both ways
#Use the View/view function to investigate your dataset.
#View(chocolate)
R has a few different types of objects. We already saw some vectors (one dimensional collection of items) before when we created object2 and object3.
R’s dataframes store two dimensional, tabular, heterogeneous data. Two-dimensional and tabular meaning a table of rows and columns that form the 2 dimensions. Heterogeneous meaning that each column can contain a different type of data (i.e., one column Age is numeric while Gender is a character).
A dataset is considered “tidy” when each variable forms a column, each observation forms a row, and each cell only contains one piece of data. This means that the entries within a column should all be the same type as each other.
We can also directly call the dim() function to see the dimensions of the chocolate dataset.
dim(chocolate) #rows then columns
## [1] 2530 11
We can ask for the number of rows and the number of columns
nrow(chocolate)
## [1] 2530
ncol(chocolate)
## [1] 11
To see all the column headings we can call the function names()
names(chocolate)
## [1] "ref" "company_manufacturer"
## [3] "company_location" "country_of_bean_origin"
## [5] "review_date" "rating"
## [7] "specific_bean_origin_or_bar_name" "num_ingredients"
## [9] "ingredients" "cocoa_percent"
## [11] "most_memorable_characteristics"
And probably the two you’ll use the most to inspect data frames, because they are the most descriptive, are summary() and str(), both of which we used above to inspect vectors
summary(chocolate)
## ref company_manufacturer company_location country_of_bean_origin
## Min. : 5 Length:2530 Length:2530 Length:2530
## 1st Qu.: 802 Class :character Class :character Class :character
## Median :1454 Mode :character Mode :character Mode :character
## Mean :1430
## 3rd Qu.:2079
## Max. :2712
##
## review_date rating specific_bean_origin_or_bar_name
## Min. :2006 Min. :1.000 Length:2530
## 1st Qu.:2012 1st Qu.:3.000 Class :character
## Median :2015 Median :3.250 Mode :character
## Mean :2014 Mean :3.196
## 3rd Qu.:2018 3rd Qu.:3.500
## Max. :2021 Max. :4.000
##
## num_ingredients ingredients cocoa_percent
## Min. :1.000 Length:2530 Min. : 42.00
## 1st Qu.:2.000 Class :character 1st Qu.: 70.00
## Median :3.000 Mode :character Median : 70.00
## Mean :3.042 Mean : 71.64
## 3rd Qu.:4.000 3rd Qu.: 74.00
## Max. :6.000 Max. :100.00
## NA's :163
## most_memorable_characteristics
## Length:2530
## Class :character
## Mode :character
##
##
##
##
str(chocolate)
## 'data.frame': 2530 obs. of 11 variables:
## $ ref : int 5 15 15 15 15 15 15 24 24 24 ...
## $ company_manufacturer : chr "Jacque Torres" "Green & Black's (ICAM)" "Guittard" "Neuhaus (Callebaut)" ...
## $ company_location : chr "U.S.A." "U.K." "U.S.A." "Belgium" ...
## $ country_of_bean_origin : chr "Ghana" "Blend" "Colombia" "Blend" ...
## $ review_date : int 2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
## $ rating : num 2 2.5 3 2 2.75 2 3.5 3.75 2 2 ...
## $ specific_bean_origin_or_bar_name: chr "Trinatario Treasure" "Dark" "Chucuri" "West Africa" ...
## $ num_ingredients : int 5 5 5 5 5 5 5 3 4 4 ...
## $ ingredients : chr "5- B,S,C,V,L" "5- B,S,C,V,L" "5- B,S,C,V,L" "5- B,S,C,V,L" ...
## $ cocoa_percent : num 71 70 65 73 75 82 70 75 60 85 ...
## $ most_memorable_characteristics : chr "gritty, unrefined, off notes" "mildly rich, basic, roasty" "creamy, sweet, floral, vanilla" "non descript, poor aftertaste" ...
You might have noticed a $ in front of the variable names in the str() output. That symbol is how we access individual variables, or columns, from a dataframe
The syntax we want is dataframe$columnname Let’s look at the country_of_bean_origin column
#Using a $ to call up specific variables or columns
chocolate$country_of_bean_origin
That function calls the whole column, which is over 2500 observations long. Usually printing out a long vector or column to the console is not useful.
head() is a function allowing us to look at just the first 6 entries
head(chocolate$country_of_bean_origin)
## [1] "Ghana" "Blend" "Colombia" "Blend" "Sao Tome" "Blend"
What if we want to see the first 20 values?
Let’s see if we can find out by calling help on the head() function To call the help menu put a ? before the function. This will help you format your argument and know what variables are required. The help menu is located in the lower right.
#Calling the help menu
?head
The help menu tells us what head() does and it also specifies the other arguments that we could input to the head() function in the Arguments section. This is always a good section to check out. Remember that an argument is an option we specify to a function to change how the function operates.
Let’s try adding the n = argument to head()
head(chocolate$country_of_bean_origin, n = 20)
## [1] "Ghana" "Blend" "Colombia"
## [4] "Blend" "Sao Tome" "Blend"
## [7] "Blend" "Blend" "Blend"
## [10] "Blend" "Sao Tome" "Dominican Republic"
## [13] "Madagascar" "Papua New Guinea" "Venezuela"
## [16] "U.S.A." "Venezuela" "Venezuela"
## [19] "Jamaica" "Colombia"
Although we can specify the head function without naming the arguments, it is good practice to label the arguments to clarify what the code is doing. However, it is conventional to skip labeling the first argument, x, since its label is easily assumed.
You may have read in the help menu that head() has a companion function tail() that shows the last n rows
tail(chocolate$country_of_bean_origin, n = 20)
## [1] "Dominican Republic" "Brazil" "Belize"
## [4] "Vietnam" "India" "Peru"
## [7] "Venezuela" "China" "Vietnam"
## [10] "Bolivia" "Madagascar" "U.S.A."
## [13] "Venezuela" "Venezuela" "Peru"
## [16] "Peru" "Philippines" "Papua New Guinea"
## [19] "Indonesia" "Malaysia"
Let’s calculate some descriptive statistics for the rating of each chocolate using the `rating`` column/variable. Again, we will look at the equations for these functions later on.
mean(chocolate$rating)
## [1] 3.196344
sd(chocolate$rating)
## [1] 0.4453213
median(chocolate$rating)
## [1] 3.25
IQR(chocolate$rating)
## [1] 0.5
range(chocolate$rating)
## [1] 1 4
For variables where there are missing valeus, we will need to include an argument that removes the missing values. Try to calculate the sum of the number of ingredients.
sum(chocolate$num_ingredients)
## [1] NA
NA will pop up. To avoid this issue you need to remove the missing variables.
?sum
Looking at the arguments section tells me that the argument I need to include is na.rm = TRUE
sum(chocolate$num_ingredients, na.rm = TRUE)
## [1] 7201
In order to use additional functionality in R, we bring in packages. The install.packages() installs the package to your local machine, while the library() command loads the functions from that package into the R environment so that we can use them. You need to run the library function everytime you open a new script/markdown file, but you do not need to run the install.packages() function everytime.
#install.packages("tidyverse")
library(tidyverse)
In this case, the code opened a library containing R functions we want to use. You can think of libraries like apps on your phone. We have now opened the app so we can use it. The output in the console specifies which libraries we have loaded. We can see based on the output from the library(tidyverse) line that the tidyverse is actually a megapackage, containing 8 packages. All of these packages share a similar syntax in an attempt to simplify coding and readability for R users. Aside from the core tidyverse packages, there are around 10 other packages
Below, we are using the read_csv function rather than the read.csv function. The read_csv function reads in your dataframe as a tibble, which is essentially a dataframe with some added functionality and aesthetic flourishes.
The read_csv function is only available if you are able to load up the readr package which is a part of the tidyverse.
new_chocolate <- read_csv("chocolate.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## ref = col_double(),
## company_manufacturer = col_character(),
## company_location = col_character(),
## country_of_bean_origin = col_character(),
## review_date = col_double(),
## rating = col_double(),
## specific_bean_origin_or_bar_name = col_character(),
## num_ingredients = col_double(),
## ingredients = col_character(),
## cocoa_percent = col_double(),
## most_memorable_characteristics = col_character()
## )