R for Data Science, 2nd Edition
R Programming for Statistics and Data Science
The working directory in R is the folder where you are working. Hence, it’s the place (the environment) where you have to store your files of your project in order to load them or where your R objects will be saved.
Session > Set Working Directory > Choose Directory
R packages are like toolkits or collections of pre-built functions, data sets, and tools that extend the capabilities of the R programming language.
Tidyverse is a collection of packages focused on data analysis and data visualizations that share an underlying design philosophy, grammar, and data structures.
| tibble | lighter and more user-friendly version of data frames |
| tidyr | create tidy and meaningfully arranged data |
| readr | better importation of data into R |
| ggplot | data visualization functions |
| dplyr | data manipulation tools |
| lubridate | clean dates and times |
| purr | better functional programming |
| forcats | handle, clean, and manipulate categorical variables |
| haven | read and write data formats from proprietary statistical packages |
The read_csv function allows you to load data into R in a tibble data frame
read_csv(“data_set.csv”)
readXl - this function allows you to read Excel files in a tibble data frame
The package Haven allows you to read and export non-proprietary files for SPSS, SAS, and STATA
The pipe operator (|> or %>%) allows you to run commands or operation on a single object based on an order of operations
| PC | MAC | |
|---|---|---|
| Pipe Operator | CTRL + SHIFT + M | CMD + SHIFT + M |
let’s say you want to see the name, height, mass, and species of characters who were born on Tatooine
With the pipe operator, your code becomes more organized and reads like a step-by-step process:
view function: interactively explore the contents of a data frame in a separate viewer window or in the RStudio viewer pane.
glimpse function: a concise overview of the data, including variable types.
| filter | retains or filters out observations based on variable criteria |
| select | retains or filters out variables |
| arrange | sorts variables |
| mutate | change variable’s observations OR create a new variable and observations using observations from another variable |
| group_by | group observations |
| summarise | get descriptive statistics about a variable |
allows you to select rows in your data frame that meet specific conditions or criteria in a variable
boolean operators allow you to build criteria in your code
| & | AND |
| | | OR |
| == | EQUAL |
| != | NOT EQUAL |
| < | LESS THAN |
| > | GREATER THAN |
| <= | LESS THAN OR EQUAL |
| >= | GREATER THAN OR EQUAL |
let’s filter the data frame for characters who have blue eyes and were born after 50 BBY
allows you to keep or discard variables
creates new variables in your data or change existing variables by performing calculations or transformations.
NOTE: if you name your variable as an existing variable, it will overwrite the existing variable. If you give it a new name, it will create a new variable
allows you to sort variables
the group_by function allows you to group common observations in a variable
summarise function allows you to get descriptive statistics about the groupings
The as. function along with mutate will allow you to change the data type of a variable. For this example we are going to recode the character_id variable to interpret the data type as a character instead of a double
allows you to redefine a variable value as a factor using the mutate function.
we can rename the values of observations within a variable using the mutate function in combination with the recode or recode_factor functions
allows you rename variables in your data frame
missing data in numeric fields can cause an issue when trying to calculate descriptive statistics
removes all missing data from data frames or variables
we can also just drop NAs from a variable
you can also recode the NA values for observations with mutate and replace_na
the write_csv function allows us to export data frames to a csv file once we are done cleaning it up or when we have done some analysis that we want to export
we can even export files that we have been working on as proprietary files to work on in SPSS, SAS, or STATA
Psych Package - built-in functions for factor analysis, reliability analysis, descriptive statistics and data visualization.
SummaryTools Package - simplifies data exploration and descriptive statistics generation for data frames and vectors.
DataExplorer package - automates and streamlines the process of exploring and visualizing datasets.