Intro to R Help File

Author

kasey mccormick

Intro

A go to of things learned and helpful commands. A ever evolving cheatsheet on how to get R to do what I want it to do, in R studio.

Table of contents

Computer setup

  1. Download R
  1. Download R studio

Libraries Used

You need to install the packages first if it’s the first time.

#install.packages(c("readr","readxl","haven","rgl","dslabs","ggplot2","stringr","dplyr"))
# can also be done individually install.packages("readr")

Then load the libraries in those packages, each time R stuido is opened

library(readr)
library(readxl)
library(haven)
library(rgl)
library(dslabs)
library(dplyr)
library(ggplot2)
library(stringr)

You can confirm loading worked by seeing checkmark by the package in the packages tab, a line for each library is advised for readability

readr

For reading scvs files with csv’sread_scv()

readxl

To read excell files with read_xlsx()

haven

To read sav files with read_sav()

rgl

to read object files with readOBJ()

dslabs

Set of practice datasets

  • gapminder: Health and income outcomes for 184 countries from 1960 to 2016.
  • movielens: Movies and their ratings
  • heights: Self-reported Heights in Inches
  • murders: US gun murders by state for 2010
  • olive: Italian olive
  • us_contagious_diseases: Contagious disease data for US stat ### stringr
# str_detect used to compare a string with case you set
# for mtcars data fram filter the rows by looking at names and keeping ones with Mazda
mtcars |>
  filter(str_detect(row.names(mtcars), "Mazda"))
              mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4
# for murders dataframe filter the rows looking for region with the word South in them but 
# word south not extended word like Southhampton
murders |>
  filter(str_detect(region, regex("\\bSouth\\b"))) 
                  state abb region population total
1               Alabama  AL  South    4779736   135
2              Arkansas  AR  South    2915918    93
3              Delaware  DE  South     897934    38
4  District of Columbia  DC  South     601723    99
5               Florida  FL  South   19687653   669
6               Georgia  GA  South    9920000   376
7              Kentucky  KY  South    4339367   116
8             Louisiana  LA  South    4533372   351
9              Maryland  MD  South    5773552   293
10          Mississippi  MS  South    2967297   120
11       North Carolina  NC  South    9535483   286
12             Oklahoma  OK  South    3751351   111
13       South Carolina  SC  South    4625364   207
14            Tennessee  TN  South    6346105   219
15                Texas  TX  South   25145561   805
16             Virginia  VA  South    8001024   250
17        West Virginia  WV  South    1852994    27

dplyr

documentation link

documentation link

  • part of the tidyverse group of packages, includes the pipe ability since it installs dependencies. Piping is magrittr package.
#piping passes information forward, makes multiple steps easier to read and do
#piping can be written as %>% or |> 

#filter keeps rows that match a condition
mtcars |>
filter(cyl == 4)
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
#mutate helps you create modify and delete columns
#mutate to create column
mtcars |>
  mutate(price = "unknown")|>
  head(2) 
              mpg cyl disp  hp drat    wt  qsec vs am gear carb   price
Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4 unknown
Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4 unknown
# mutate to modify Column
mtcars |>
  mutate(price = ifelse(mtcars$cyl > 6, yes = "high", no = "low")) |>
  head(2)
              mpg cyl disp  hp drat    wt  qsec vs am gear carb price
Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4   low
Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4   low
# mutate to delete column
mtcars |>
  mutate(vs=NULL, drat=NULL)|>
  head(2)
              mpg cyl disp  hp    wt  qsec am gear carb
Mazda RX4      21   6  160 110 2.620 16.46  1    4    4
Mazda RX4 Wag  21   6  160 110 2.875 17.02  1    4    4
# group_by is grouped table where operations are performed by group
mtcars |> 
  group_by(cyl)|>
  head(2)
# A tibble: 2 × 11
# Groups:   cyl [1]
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1    21     6   160   110   3.9  2.62  16.5     0     1     4     4
2    21     6   160   110   3.9  2.88  17.0     0     1     4     4
#arrange, orders the rows of a data frame  
mtcars |>
  arrange(desc(cyl))|>
  head(2)
                   mpg cyl disp  hp drat   wt  qsec vs am gear carb
Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.02  0  0    3    2
Duster 360        14.3   8  360 245 3.21 3.57 15.84  0  0    3    4

ggplot2

ggplot2 is part of the tidyverse collection of packages. documentation link

#start with piping data into ggplot
# add aesthetics  
## what data to use
## how to color
# define geometry type of graph

iris |>
  ggplot(aes(Petal.Length, Petal.Width, color=Species)) +
  geom_point()

iris |>
  ggplot(aes(Petal.Length)) +
  geom_histogram(fill="steelblue",
                 binwidth = 0.5)

iris |>
  ggplot(aes(Petal.Length, fill = Species)) +
  # position = "identity" lets bars overlap; alpha controls transparency.
  geom_histogram(position = "identity", alpha = 0.6,binwidth = .5)+
  #can add themes to adjust several visuals at once
  ggplot2::theme_minimal()

aesthetics

theme documentation link

Aesthetics can be put in ggplot and plotting type. aka ggplot(aes()) and geom_point(aes())

  • x y position required
  • alpha- transparency
  • color - point color
  • fill-fill color
  • shape-point shape 0-25 (16 is circle)
  • size-point size
  • stroke- border thickness

geom histogram understands these asthetics

  • x position -required
  • fill -fill color
  • color- border color
  • binwidth -width of each bin default is 1
  • position- position adjustment eg identity dodge fill
  • alpha- transparency
  • size -border thinkneess
  • weight - weight for each observation optional
  • linetype- line type for borders optional

plot examples/types

  • geom_point() - scatter plot draws one point per row — ideal for two numeric variables.
  • geom_histogram() - bar graph
  • geom_smooth() - adds a regression line
  • geom_boxplot() -

Importing data

# loads saved objects to the Environment
# load("filename_Workspace.RData") 

Can also import data via GUI interface, but packages are needed to make it possible. Refer to packages section to determine which.

CSV

  1. environment pane
  2. import dataset
  3. from text (provided by readr)
  4. find and select csv file

the console will output the code needed, can use this syntax or continue using GUI

 # dataFrameNameYouWant <- read_csv("csv_fileName.csv")

Excell (xlsx)

  1. environment pane
  2. import dataset
  3. from excel
  4. find and select the excel file

the console will output the code needed, can use this syntax or continue using GUI

  # dataFrameNameYouWant <- read_excel("excellFileName.xls")

SPS Sav file

  1. environment pane
  2. import dataset
  3. from spss
  4. find and select the excel file

the console will output the code needed, can use this syntax or continue using GUI

# datFrameNameYouWant <- read_sav("file_path/andfilename.sav")

3-d object file

to import 3d mesh need full file path can be done two ways

#file.chose()

or

  1. files pane
  2. navigate to file
  3. click triangle copy pat absolute path

Use that path to import 3dmeshes

human_skull <- readOBJ("filepath_found_above.obj")

once the stopsign in bottom pane ends, and data is imported can view 3d object with

open3d(); shade3d(human_skull, col = "white"); bg3d("black")

Cleaning data

dealing with NA

set.seed(3) #setting seed for reproduce ability
Ninety_Nine_Red_NAs <- sample(c(1:5, rep(NA, 99))) # creating a vector with 1-5 and 99 NA's in random order
print(Ninety_Nine_Red_NAs)
  [1]  5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
 [26] NA NA NA NA NA NA NA  3 NA NA NA NA NA NA NA NA NA NA NA NA NA  1 NA NA NA
 [51] NA NA NA NA NA NA NA NA NA NA NA  4 NA NA NA NA NA NA  2 NA NA NA NA NA NA
 [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[101] NA NA NA NA
anyNA(Ninety_Nine_Red_NAs) #detect na's True, 1 or more NA exists, false no na's exist
[1] TRUE
which(is.na(Ninety_Nine_Red_NAs))  #index of positions where NA exists
 [1]   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
[20]  21  22  23  24  25  26  27  28  29  30  31  32  34  35  36  37  38  39  40
[39]  41  42  43  44  45  46  48  49  50  51  52  53  54  55  56  57  58  59  60
[58]  61  63  64  65  66  67  68  70  71  72  73  74  75  76  77  78  79  80  81
[77]  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
[96] 101 102 103 104
which(!is.na(Ninety_Nine_Red_NAs)) # indexes of positions where NA doesn't exist
[1]  1 33 47 62 69
Total_Missing_99NA  <- sum(is.na(Ninety_Nine_Red_NAs)) # count how many na's exist
print(Total_Missing_99NA)
[1] 99
Prop_Missing_99NA   <- mean(is.na(Ninety_Nine_Red_NAs)) # mean of the values that do exist
print(Prop_Missing_99NA)
[1] 0.9519231
vec_removed <- Ninety_Nine_Red_NAs[!is.na(Ninety_Nine_Red_NAs)] #remove na's just leave data
print(vec_removed)
[1] 5 3 1 4 2
vec_imputed <- Ninety_Nine_Red_NAs                       # ▶ Make a copy to preserve original
vec_imputed[is.na(vec_imputed)] <- mean(vec_imputed, na.rm = TRUE) #put the mean in place of na's
print(vec_imputed)
  [1] 5 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [38] 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3 3 3 3 3 2 3 3 3 3 3
 [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

You can chose to remove the na’s or replace na’s with the mean value of what data exists.

Saving your work

You’ll want to save your .R file contains the code, and RData file which contains the objects To save the .R file you can do keyboard shortcut to save, as you go is advised. Hit the floppy disk icon or file save. To save the RData file, it’s easiest to do the following command in console. So far it’s advised to have Workspace suffix on the file name.

save.image("filename_Workspace.RData")

This will save the file in the set working directory. You can find out which directory is set with

# getwd()

If it’s not the location you want you can set the directory with

# setwd("file/path")

You can also set the working directory in R studio at the often bottom right pane:

  1. files tab
  2. navigate to desired folder
  3. hit gear icon
  4. select set working directory (the console will output setwd(“filepath/you/chose”)

Viewing Data

food <- c("apple", "bannaba","dark chocolate") #this is a vector of chars
class(food) # tells you what type of data the vector is
[1] "character"
rm(list=ls())# removes the data from the environment
# can also be done by clicking the broom icon (🧹) in environment tab and console job

# mtcars is a dataframe, head shows the first six rows of the dataframe.
# I use this to get idea of what data is there, mainly column names
head(mtcars) 
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
#can also define rows to see rather than 6
mtcars |> head(2)
              mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4
# mtcars is dataframe with basic r, and shows row and column count 
# I use this to get idea of how large datafrme is
dim(mtcars) 
[1] 32 11
#number of columns, if I need to know col number only
ncol(mtcars)
[1] 11
#number of rows, if I need to know row number only
nrow(mtcars)
[1] 32
# opens the dataframe in another tab so you can look at the 
#table used to understand data and plan next steps used when i've never seen it
#View(mtcars)

# summary of each column's data in the dataframe, 
# used if i need to make a decision based on range of information or to have 
# a high level understanding
summary(mtcars) 
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  
# prints the dataframe in console or on page in qmd to get easy quick view of 
#information, typically don't use dataframes in it but vectors
print(mtcars) 
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
#summarize tells you what you define for each column
mtcars |> summarize(mean_mpg = mean(mpg))
  mean_mpg
1 20.09062
#summarize can be done for multimple columns in one call, seperated by commas
mtcars|> summarize(mean_mpg = mean(mpg), 
            sd_mpg   = sd(mpg),#standard deviation
            n        = n())#number of
  mean_mpg   sd_mpg  n
1 20.09062 6.026948 32

Rows

# how many rows in dataframe
nrow(mtcars) 
[1] 32
#filter keeps ROWS that match a condition
filter(mtcars, mpg >= 25) 
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#keep rows that match a condition
more_25 <- mtcars |> filter(mpg >= 25)
print(more_25)
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

Columns

# how many columns in dataframe
ncol(mtcars) 
[1] 11
#select/show cyl column data from mtcars dataframe
mtcars$cyl 
 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
# vector keep all rows but remove columns 8,9
mtcars[,-c(8,9)] 
                     mpg cyl  disp  hp drat    wt  qsec gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60    4    2
#show data of dataframe with specific columns. 
# mtcars |> select(mpg, cyl, hp) # same different syntax From dplyr
select(mtcars, mpg, cyl,hp) 
                     mpg cyl  hp
Mazda RX4           21.0   6 110
Mazda RX4 Wag       21.0   6 110
Datsun 710          22.8   4  93
Hornet 4 Drive      21.4   6 110
Hornet Sportabout   18.7   8 175
Valiant             18.1   6 105
Duster 360          14.3   8 245
Merc 240D           24.4   4  62
Merc 230            22.8   4  95
Merc 280            19.2   6 123
Merc 280C           17.8   6 123
Merc 450SE          16.4   8 180
Merc 450SL          17.3   8 180
Merc 450SLC         15.2   8 180
Cadillac Fleetwood  10.4   8 205
Lincoln Continental 10.4   8 215
Chrysler Imperial   14.7   8 230
Fiat 128            32.4   4  66
Honda Civic         30.4   4  52
Toyota Corolla      33.9   4  65
Toyota Corona       21.5   4  97
Dodge Challenger    15.5   8 150
AMC Javelin         15.2   8 150
Camaro Z28          13.3   8 245
Pontiac Firebird    19.2   8 175
Fiat X1-9           27.3   4  66
Porsche 914-2       26.0   4  91
Lotus Europa        30.4   4 113
Ford Pantera L      15.8   8 264
Ferrari Dino        19.7   6 175
Maserati Bora       15.0   8 335
Volvo 142E          21.4   4 109
#mutating data in a column example changing number values to low medium or high
quake_recode <- quakes |> 
  mutate(mag = ifelse(mag <= 4.5, "low",
  ifelse(mag <=5.5,"medium", "high")))
head(quake_recode)
     lat   long depth    mag stations
1 -20.42 181.62   562 medium       41
2 -20.62 181.03   650    low       15
3 -26.00 184.10    42 medium       43
4 -17.97 181.66   626    low       19
5 -20.42 181.96   649    low       11
6 -19.68 184.31   195    low       12
#it's best to add a new column with mutated data however 
quake_recode_new <- quakes |> 
  mutate(mag_level = ifelse(mag <= 4.5, "low",
  ifelse(mag <=5.5,"medium", "high")))
head(quake_recode_new)
     lat   long depth mag stations mag_level
1 -20.42 181.62   562 4.8       41    medium
2 -20.62 181.03   650 4.2       15       low
3 -26.00 184.10    42 5.4       43    medium
4 -17.97 181.66   626 4.1       19       low
5 -20.42 181.96   649 4.0       11       low
6 -19.68 184.31   195 4.0       12       low
#get names of columns
colnames(quakes) 
[1] "lat"      "long"     "depth"    "mag"      "stations"
#change name of column
"scared_level"-> colnames(quake_recode_new)[which(names(quake_recode_new) == "mag_level")] 

head(quake_recode_new,2)
     lat   long depth mag stations scared_level
1 -20.42 181.62   562 4.8       41       medium
2 -20.62 181.03   650 4.2       15          low
#add column covered in dplyr section

Combinations

Combination in this case means effecting columns, and rows.

# Scenario: Pick cars that weigh < 3,000 lb, have hp between 110 and 180, and
# cylinders IN {4,6}.

light_midpower <- mtcars[mtcars$wt < 3 &           # weight is in 1000 lb
                         mtcars$hp >= 110 & 
                         mtcars$hp <= 180 &
                         mtcars$cyl %in% c(4, 6), ]

Built in datasets

  • mtcars: Motor Trend Car Road Tests
  • CO2: Carbon Dioxide Uptake in Grass Plants
  • iris: Edgar Anderson’s Iris Data
  • quakes: Locations of Earthquakes off Fiji

Data wrangling

#setting seed for reproducability
set.seed(12)

# Sample randomly chose based on arguments passed in
#1:10 is 1,2,3,4,5,6,7,8,9,10 
# I didn't clarify how many i needed so did every number in random order
example_vec <- sample(1:10)
example_vec # so it prints
 [1]  2  7  3  6  5  9  4 10  8  1
#sum add it all up
sum(example_vec) 
[1] 55
#find the mean, add every number then divide by how many numbers
mean(example_vec)
[1] 5.5
# tell me what type of data is in the vector
str(example_vec)
 int [1:10] 2 7 3 6 5 9 4 10 8 1
# standard deviation amount of variation of values to it's mean
sd(example_vec) 
[1] 3.02765
# middle value
median(example_vec)
[1] 5.5
# sequence with no further arguments puts the data in an order
seq(example_vec)
 [1]  1  2  3  4  5  6  7  8  9 10
# remove the first item in the vector
example_vec[-1]
[1]  7  3  6  5  9  4 10  8  1
#create vector of negative numbers, to be used as indexes
list_remove <- c(-1,-3)

#remove by specific index
example_vec[list_remove]
[1]  7  6  5  9  4 10  8  1
# sapply() applies a function to each element of a list or vector
# in this case counting rows in each dataframe
# output as vector
sapply(list(mtcars, murders, heights), nrow)
[1]   32   51 1050
# lapply() applies a function to each element of a list or vector
# output as list
lapply(list(mtcars, murders, heights), nrow)
[[1]]
[1] 32

[[2]]
[1] 51

[[3]]
[1] 1050
# find unique instances of the data, in this case what
# diseases are in the disease column of us_contagious_diseases datafram
unique(us_contagious_diseases$disease) 
[1] Hepatitis A Measles     Mumps       Pertussis   Polio       Rubella    
[7] Smallpox   
Levels: Hepatitis A Measles Mumps Pertussis Polio Rubella Smallpox

If else

ifelse(test = c(1, 2, 3, 4, 5) > 3, yes = "Big", no = "Small") # like ternary in javascript
[1] "Small" "Small" "Small" "Big"   "Big"  
object <- c("Yellow", "banna", "apple","red")
ifelse(object == "Yellow",  yes="color", no="not a color") #depending on test recode the value to be as set for yes/true and no/false
[1] "color"       "not a color" "not a color" "not a color"
#multi nested
object2 <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
ifelse(object2 == 1, "One",
       ifelse(object2 == 2, "Two",
              ifelse(object2 == 3, "Three",
                     ifelse(object2 == 4, "Four","Five" ))))
 [1] "One"   "One"   "Two"   "Two"   "Three" "Three" "Four"  "Four"  "Five" 
[10] "Five" 

Getting time information

Sys.Date()      # current date (yyyy‑mm‑dd)
[1] "2025-06-02"
Sys.time()      # current date + time
[1] "2025-06-02 01:17:42 MDT"
Sys.timezone()  # your local timezone string
[1] "America/Boise"

Base R plotting

# role a dice 100 times, dice has six sides, when a number is rolled put it back (Replace) it to be rolled again
dice <- sample(1:6, 100, replace = TRUE) 

#show a histogram, bar chart
hist(dice,
     breaks = seq(0.5, 6.5, by = 1), # forcing it to be in bin order
     col    = "lightblue",# color of the bars
     main   = "100 Fair Die Rolls", # title of the plot #main is label of plot
     xlab   = "Face Value") # x axis label

# breaks can have algorithms as setting also ex "Scott", "Sturges"

Creating a function

# creating a vector of words, to look for
vector_to_loop_through <- c("trust","ethics")
vector_to_look_through <- c("Sentances from a paper, where you're trying to see if they talk about any key words you're looking for", "in this example looking for a sentance with the word trust, pulling the whole sentance to grab context" )


# create a list for every item in vector_to_loop_through go through vector_to_look_through and string_detect(look for and compare) if it exists, case doesn't matter
Index <- lapply(vector_to_loop_through, function(w) {
  which(str_detect(vector_to_look_through, regex(paste0("\\b", w, "\\b"),
                                         ignore_case = TRUE)))
})

Create dataframe

#set seed reproducability
set.seed(8) 
# 30 is the number of times replicated. 
# 12 is replications of "marshmallow" 
# Sample will mix it all up because size is not defined.
Type  <- sample(c(rep("Cereal", 30), rep("Marshmallow", 12)))
# type is equal to cereal and plain
# sample x is c("heart"..) size is length(Type)
# ifelse(test) if this, then do it
Shape <- ifelse(Type == "Cereal", "Plain", 
                sample(c("Heart", "Star", "Horseshoe", "Clover",
                        "Blue_Moon", "Pot_of_Gold", "Rainbow", 
                        "Red_Balloon"), length(Type), replace = TRUE))
#create a dataframe called snack_box with the type and shape vectors 
snack_box <- data.frame(Type, Shape)

head(snack_box,2)
         Type     Shape
1 Marshmallow   Rainbow
2 Marshmallow Horseshoe

Troubleshooting

Getting help

help(sample) 

?sample

#both open the help window pane, in this case for information about sample function 

Naming

this_style (snake 🐍 ) or thisStyle (camel 🐪 ) but this style with a space is not advised for csv and other data as well as vector and dataset names.

Object rules

  1. names have to start with a lettor or . not followed by a number
  2. can contain letters, numbers, _ and . but no spaces
  3. are case‑sensitive (Data vs data)

Dealing with NA

imputed - replacing na’s with the mean value

removal - remove the na’s completely

reflection: When finding mean of two vectors that had imputed and removal I didn’t expect the mean to be the same on first thought since i’m adding data in. But it’s mean so it doesn’t change.

Useful tip: na.omit() removes rows with any NA values

Importing files

Excell can have multiple sheets and you have to do a sheets call read_excell(“~album.xls”, sheet=“sheetname”)

Setting Seed

# you set a seed so you can create reproduceable results, the number at this time is not importaint
set.seed(123) 

qmd

  • code blocks will run, and answers will appear right below them
  • code block ticks ``` cannot have spaces in front of them, or it renders incorrectly
  • enter after bold things if want text below the bolded item.
  • suppressing messages for specific chunk {r message=FALSE}
  • need space after the - for a bulletpoint to show up

helpful tips

  • you can double click the data in environment pane to have the table open in a tab
  • you can assign at the begining or end of the code
name_of_dataframe_or_vector <- sample(1:10, 5, replace=TRUE)
sample(1:10, 5, replace=TRUE) -> other_name_of_dataframe_or_vector
  • adding a pipe %>% or |> makes the code a pipeline
  • can add unique() at the end of a pipeline, no argument needed since pipe passes it through
  • can see list of pre-loaded data with function data() In r and packages loaded
  • see specific number of rows in dataframe not in a pipeline can be done with head(dataframe,numberofrows)

Troubleshooting table

Error Solution 1 Solution 2
Error: object not found didn’t run code to create object
Error: unexpected symbol in “1name” object 1name can’t start with a number
Error: ! could not find function view View needs to be capitalized
Error: unexpected symbol (in a nested if else) likely forgot a comma after yes assignment
Error: argument “no” is missing, with no default if else statement doesn’t have no condition set
Error: Caused by error in ifelse():! unused argument (0) have a comma in the yes or no assignment example 100,000 it’s reading the comma, remove it numbers cant have comma
Error: could not find function “str_detect” stringr isn’t loaded
Error in parse():! :2:14: unexpected numeric constant I have a period where a comma should be