RStudio Panes

On the top left is the script window. This is where we are going to write all of our code and notes.
On the lower left there’s the console window. This is where R tells us what it thinks we told it and then the answer. This part is what R would look like (without RStudio)
The top right has the environment and history tabs. The environment is a list of all objects that R knows and the history tab shows all the code that has been run. This is where excel and datasets will appear.
On the bottom right there’s a window with lots of tabs. Files provides the file structure in the working directory. Plots is where your visualizations will appear. Packages shows all of the installed packages and one that are checked are opened. Help is where we will learn about functions when we need assistance and the Viewer is for viewing other kinds of output, like web content.

RStudio Global Options

So that all of our set-ups look the same follow me to change a few settings. Go to Tools –> Global Options
- In the Code menu check the box for “Soft-wrap R source files”
- In the R Markdown menu UNcheck the box for “Show output inline for all RMarkdown documents”

Once these options are set, they will remain active for every R session you open on your local machine so you will not need to follow these steps again.

Types of Data

Categorical
- data that can be divided into categories and has no meaningful order
Ordinal
- divided into categories with a logical order
Quantitative
- uses real numbers
- Discrete: data that includes whole, concrete numbers with specific and fixed data values determined by coundint
- Continuous: values that can take on any value within a finite or infinite interval

Shortcut for adding chunks

Command + Option + I

Descriptive Statistics

Measures of the center: Mean, Median Mode

Mean: The sample mean measures the center of a sample’s distribution

$\overline{y}=mean(y_{i})$

$\overline{y}=\sum_{n=1}^{10} y_{i}/n$

Median: Middle observation after sorting
- If n is odd, then it is the middle number after sorting
- If n is even, it is the average of the middle two numbers after sorting
Mode: Most frequently observed value(s) -When to use Mean or Median
- The answer is that it depends on the distribution shape for that variable
- The mean uses all of the data.
- The median uses only the middle or middle two numbers (though the other numbers do determine where the middle is).
- The mean is extensively used in statistics, particularly the kind we’re going to learn
- The median is better when the data are highly skewed, very spread out, or have lots of outliers

Measures of variability: Minimum, Maximum, Range, Interquartile range, variance, Standard deviation

Minimum: Lowest value after sorting
Maximum: Highest value after sorting
Range: Maximum-Minimum
Interquartile Range (IQR): If data is divided into four quartiles the IQR= Q3-Q1 where Q= quartile
- Q2 is the median of the data, Q3 is the median of the upper half of the dataset and Q1 is the median of the lower half of the dataset
Variance and Standard Deviation
- both are based on deviations from the mean.
- A deviation is the distance a single value is from the mean.
  - For example, our mean was 17.83. For the value of 20, the deviation would be equal to 20 – 17.83 = 2.17. For the value of 12, the deviation would be 12 – 17.83 = -5.83.
  - If we do those for all of the values and then sum it up, we would get 0, which is not very useful. We need a method to get the absolute value of the deviations and, in a way, average them.
  - Use the Sum of Squares
    - $SS= sum(yi-\overline{y})^2$
  - Variance= $s^2 = SS/(n-1)$
  - Standard deviation= $\sqrt{s^2}= \sqrt{ss/(n-1)}$

Functions

A function is a verb; it tells R to do something. To call an R function, we call the name of the function followed directly by (). Recall our use of the read.csv() and View() functions. The items passed to the function inside the () are called arguments. Arguments change the way a function behaves

Example: Numeric variables

object1 <- c(55, 60, 35, 70)

Taking a Sum:

sum(object1)

## [1] 220

Taking the Mean:

mean(object1)

## [1] 55

Taking the Square root:

sqrt(object1)

## [1] 7.416198 7.745967 5.916080 8.366600

Taking the standard deviation:

sd(object1)

## [1] 14.7196

Most functions in R are vectorized meaning that they will work on a vector as well as a single value. This means that in R, we usually do not need to write loops like we would in other languages.

Summary is a function that is useful for numeric variables: It will give you the minimum, Q1, Median, Mean, Q3, and Maximum

summary(object1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    35.0    50.0    57.5    55.0    62.5    70.0

Functions to inspect dataframes

The first thing we are going to do is read in data from a csv file. We are initially going to use the read.csv function which comes with your normal Base R installation. Later on, we will use a different version of this function. For now, just run the code and see what happens.

chocolate <- read.csv("chocolate.csv")

To View() the dataset in spreadsheet form, we can click on the dataset’s name in the Environment tab. Notice that this action is accompanied by some code in the console telling us that we could also get there using code. Let’s try it both ways

#Use the View/view function to investigate your dataset.
#View(chocolate)

R has a few different types of objects. We already saw some vectors (one dimensional collection of items) before when we created object2 and object3.

R’s dataframes store two dimensional, tabular, heterogeneous data. Two-dimensional and tabular meaning a table of rows and columns that form the 2 dimensions. Heterogeneous meaning that each column can contain a different type of data (i.e., one column Age is numeric while Gender is a character).

A dataset is considered “tidy” when each variable forms a column, each observation forms a row, and each cell only contains one piece of data. This means that the entries within a column should all be the same type as each other.

We can also directly call the dim() function to see the dimensions of the chocolate dataset.

dim(chocolate) #rows then columns

## [1] 2530   11

We can ask for the number of rows and the number of columns

nrow(chocolate)

## [1] 2530

ncol(chocolate)

## [1] 11

To see all the column headings we can call the function names()

names(chocolate)

##  [1] "ref"                              "company_manufacturer"            
##  [3] "company_location"                 "country_of_bean_origin"          
##  [5] "review_date"                      "rating"                          
##  [7] "specific_bean_origin_or_bar_name" "num_ingredients"                 
##  [9] "ingredients"                      "cocoa_percent"                   
## [11] "most_memorable_characteristics"

And probably the two you’ll use the most to inspect data frames, because they are the most descriptive, are summary() and str(), both of which we used above to inspect vectors

summary(chocolate)

##       ref       company_manufacturer company_location   country_of_bean_origin
##  Min.   :   5   Length:2530          Length:2530        Length:2530           
##  1st Qu.: 802   Class :character     Class :character   Class :character      
##  Median :1454   Mode  :character     Mode  :character   Mode  :character      
##  Mean   :1430                                                                 
##  3rd Qu.:2079                                                                 
##  Max.   :2712                                                                 
##                                                                               
##   review_date       rating      specific_bean_origin_or_bar_name
##  Min.   :2006   Min.   :1.000   Length:2530                     
##  1st Qu.:2012   1st Qu.:3.000   Class :character                
##  Median :2015   Median :3.250   Mode  :character                
##  Mean   :2014   Mean   :3.196                                   
##  3rd Qu.:2018   3rd Qu.:3.500                                   
##  Max.   :2021   Max.   :4.000                                   
##                                                                 
##  num_ingredients ingredients        cocoa_percent   
##  Min.   :1.000   Length:2530        Min.   : 42.00  
##  1st Qu.:2.000   Class :character   1st Qu.: 70.00  
##  Median :3.000   Mode  :character   Median : 70.00  
##  Mean   :3.042                      Mean   : 71.64  
##  3rd Qu.:4.000                      3rd Qu.: 74.00  
##  Max.   :6.000                      Max.   :100.00  
##  NA's   :163                                        
##  most_memorable_characteristics
##  Length:2530                   
##  Class :character              
##  Mode  :character              
##                                
##                                
##                                
##

str(chocolate)

## 'data.frame':    2530 obs. of  11 variables:
##  $ ref                             : int  5 15 15 15 15 15 15 24 24 24 ...
##  $ company_manufacturer            : chr  "Jacque Torres" "Green & Black's (ICAM)" "Guittard" "Neuhaus (Callebaut)" ...
##  $ company_location                : chr  "U.S.A." "U.K." "U.S.A." "Belgium" ...
##  $ country_of_bean_origin          : chr  "Ghana" "Blend" "Colombia" "Blend" ...
##  $ review_date                     : int  2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
##  $ rating                          : num  2 2.5 3 2 2.75 2 3.5 3.75 2 2 ...
##  $ specific_bean_origin_or_bar_name: chr  "Trinatario Treasure" "Dark" "Chucuri" "West Africa" ...
##  $ num_ingredients                 : int  5 5 5 5 5 5 5 3 4 4 ...
##  $ ingredients                     : chr  "5- B,S,C,V,L" "5- B,S,C,V,L" "5- B,S,C,V,L" "5- B,S,C,V,L" ...
##  $ cocoa_percent                   : num  71 70 65 73 75 82 70 75 60 85 ...
##  $ most_memorable_characteristics  : chr  "gritty, unrefined, off notes" "mildly rich, basic, roasty" "creamy, sweet, floral, vanilla" "non descript, poor aftertaste" ...

Accessing variables from a dataframes

You might have noticed a $ in front of the variable names in the str() output. That symbol is how we access individual variables, or columns, from a dataframe

The syntax we want is dataframe$columnname Let’s look at the country_of_bean_origin column

#Using a $ to call up specific variables or columns
chocolate$country_of_bean_origin

That function calls the whole column, which is over 2500 observations long. Usually printing out a long vector or column to the console is not useful.

head() is a function allowing us to look at just the first 6 entries

head(chocolate$country_of_bean_origin)

## [1] "Ghana"    "Blend"    "Colombia" "Blend"    "Sao Tome" "Blend"

What if we want to see the first 20 values?

Let’s see if we can find out by calling help on the head() function To call the help menu put a ? before the function. This will help you format your argument and know what variables are required. The help menu is located in the lower right.

#Calling the help menu
?head

The help menu tells us what head() does and it also specifies the other arguments that we could input to the head() function in the Arguments section. This is always a good section to check out. Remember that an argument is an option we specify to a function to change how the function operates.

Let’s try adding the n = argument to head()

head(chocolate$country_of_bean_origin, n = 20)

##  [1] "Ghana"              "Blend"              "Colombia"          
##  [4] "Blend"              "Sao Tome"           "Blend"             
##  [7] "Blend"              "Blend"              "Blend"             
## [10] "Blend"              "Sao Tome"           "Dominican Republic"
## [13] "Madagascar"         "Papua New Guinea"   "Venezuela"         
## [16] "U.S.A."             "Venezuela"          "Venezuela"         
## [19] "Jamaica"            "Colombia"

Although we can specify the head function without naming the arguments, it is good practice to label the arguments to clarify what the code is doing. However, it is conventional to skip labeling the first argument, x, since its label is easily assumed.

You may have read in the help menu that head() has a companion function tail() that shows the last n rows

tail(chocolate$country_of_bean_origin, n = 20)

##  [1] "Dominican Republic" "Brazil"             "Belize"            
##  [4] "Vietnam"            "India"              "Peru"              
##  [7] "Venezuela"          "China"              "Vietnam"           
## [10] "Bolivia"            "Madagascar"         "U.S.A."            
## [13] "Venezuela"          "Venezuela"          "Peru"              
## [16] "Peru"               "Philippines"        "Papua New Guinea"  
## [19] "Indonesia"          "Malaysia"

Let’s calculate some descriptive statistics for the rating of each chocolate using the `rating`` column/variable. Again, we will look at the equations for these functions later on.

mean(chocolate$rating)

## [1] 3.196344

sd(chocolate$rating)

## [1] 0.4453213

median(chocolate$rating)

## [1] 3.25

IQR(chocolate$rating)

## [1] 0.5

range(chocolate$rating)

## [1] 1 4

For variables where there are missing valeus, we will need to include an argument that removes the missing values. Try to calculate the sum of the number of ingredients.

sum(chocolate$num_ingredients)

## [1] NA

NA will pop up. To avoid this issue you need to remove the missing variables.

?sum

Looking at the arguments section tells me that the argument I need to include is na.rm = TRUE

sum(chocolate$num_ingredients, na.rm = TRUE)

## [1] 7201

Additional Functionality -> Installing and Loading Packages.

In order to use additional functionality in R, we bring in packages. The install.packages() installs the package to your local machine, while the library() command loads the functions from that package into the R environment so that we can use them. You need to run the library function everytime you open a new script/markdown file, but you do not need to run the install.packages() function everytime.

#install.packages("tidyverse")
library(tidyverse)

In this case, the code opened a library containing R functions we want to use. You can think of libraries like apps on your phone. We have now opened the app so we can use it. The output in the console specifies which libraries we have loaded. We can see based on the output from the library(tidyverse) line that the tidyverse is actually a megapackage, containing 8 packages. All of these packages share a similar syntax in an attempt to simplify coding and readability for R users. Aside from the core tidyverse packages, there are around 10 other packages

Below, we are using the read_csv function rather than the read.csv function. The read_csv function reads in your dataframe as a tibble, which is essentially a dataframe with some added functionality and aesthetic flourishes.

The read_csv function is only available if you are able to load up the readr package which is a part of the tidyverse.

new_chocolate <- read_csv("chocolate.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   ref = col_double(),
##   company_manufacturer = col_character(),
##   company_location = col_character(),
##   country_of_bean_origin = col_character(),
##   review_date = col_double(),
##   rating = col_double(),
##   specific_bean_origin_or_bar_name = col_character(),
##   num_ingredients = col_double(),
##   ingredients = col_character(),
##   cocoa_percent = col_double(),
##   most_memorable_characteristics = col_character()
## )

R and Statistics Notes

Abigail Whitford

Spring 2022