#install.packages(c("readr","readxl","haven","rgl","dslabs","ggplot2","stringr","dplyr"))
# can also be done individually install.packages("readr")Intro to R Help File
Intro
A go to of things learned and helpful commands. A ever evolving cheatsheet on how to get R to do what I want it to do, in R studio.
Table of contents
Computer setup
- Download R
- Download R studio
Libraries Used
You need to install the packages first if it’s the first time.
Then load the libraries in those packages, each time R stuido is opened
library(readr)
library(readxl)
library(haven)
library(rgl)
library(dslabs)
library(dplyr)
library(ggplot2)
library(stringr)You can confirm loading worked by seeing checkmark by the package in the packages tab, a line for each library is advised for readability
readr
For reading scvs files with csv’sread_scv()
readxl
To read excell files with read_xlsx()
haven
To read sav files with read_sav()
rgl
to read object files with readOBJ()
dslabs
Set of practice datasets
- gapminder: Health and income outcomes for 184 countries from 1960 to 2016.
- movielens: Movies and their ratings
- heights: Self-reported Heights in Inches
- murders: US gun murders by state for 2010
- olive: Italian olive
- us_contagious_diseases: Contagious disease data for US stat ### stringr
# str_detect used to compare a string with case you set
# for mtcars data fram filter the rows by looking at names and keeping ones with Mazda
mtcars |>
filter(str_detect(row.names(mtcars), "Mazda")) mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
# for murders dataframe filter the rows looking for region with the word South in them but
# word south not extended word like Southhampton
murders |>
filter(str_detect(region, regex("\\bSouth\\b"))) state abb region population total
1 Alabama AL South 4779736 135
2 Arkansas AR South 2915918 93
3 Delaware DE South 897934 38
4 District of Columbia DC South 601723 99
5 Florida FL South 19687653 669
6 Georgia GA South 9920000 376
7 Kentucky KY South 4339367 116
8 Louisiana LA South 4533372 351
9 Maryland MD South 5773552 293
10 Mississippi MS South 2967297 120
11 North Carolina NC South 9535483 286
12 Oklahoma OK South 3751351 111
13 South Carolina SC South 4625364 207
14 Tennessee TN South 6346105 219
15 Texas TX South 25145561 805
16 Virginia VA South 8001024 250
17 West Virginia WV South 1852994 27
dplyr
- part of the tidyverse group of packages, includes the pipe ability since it installs dependencies. Piping is magrittr package.
#piping passes information forward, makes multiple steps easier to read and do
#piping can be written as %>% or |>
#filter keeps rows that match a condition
mtcars |>
filter(cyl == 4) mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
#mutate helps you create modify and delete columns
#mutate to create column
mtcars |>
mutate(price = "unknown")|>
head(2) mpg cyl disp hp drat wt qsec vs am gear carb price
Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 unknown
Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 unknown
# mutate to modify Column
mtcars |>
mutate(price = ifelse(mtcars$cyl > 6, yes = "high", no = "low")) |>
head(2) mpg cyl disp hp drat wt qsec vs am gear carb price
Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 low
Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 low
# mutate to delete column
mtcars |>
mutate(vs=NULL, drat=NULL)|>
head(2) mpg cyl disp hp wt qsec am gear carb
Mazda RX4 21 6 160 110 2.620 16.46 1 4 4
Mazda RX4 Wag 21 6 160 110 2.875 17.02 1 4 4
# group_by is grouped table where operations are performed by group
mtcars |>
group_by(cyl)|>
head(2)# A tibble: 2 × 11
# Groups: cyl [1]
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#arrange, orders the rows of a data frame
mtcars |>
arrange(desc(cyl))|>
head(2) mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
Duster 360 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
ggplot2
ggplot2 is part of the tidyverse collection of packages. documentation link
#start with piping data into ggplot
# add aesthetics
## what data to use
## how to color
# define geometry type of graph
iris |>
ggplot(aes(Petal.Length, Petal.Width, color=Species)) +
geom_point()iris |>
ggplot(aes(Petal.Length)) +
geom_histogram(fill="steelblue",
binwidth = 0.5)iris |>
ggplot(aes(Petal.Length, fill = Species)) +
# position = "identity" lets bars overlap; alpha controls transparency.
geom_histogram(position = "identity", alpha = 0.6,binwidth = .5)+
#can add themes to adjust several visuals at once
ggplot2::theme_minimal()aesthetics
Aesthetics can be put in ggplot and plotting type. aka ggplot(aes()) and geom_point(aes())
- x y position required
- alpha- transparency
- color - point color
- fill-fill color
- shape-point shape 0-25 (16 is circle)
- size-point size
- stroke- border thickness
geom histogram understands these asthetics
- x position -required
- fill -fill color
- color- border color
- binwidth -width of each bin default is 1
- position- position adjustment eg identity dodge fill
- alpha- transparency
- size -border thinkneess
- weight - weight for each observation optional
- linetype- line type for borders optional
plot examples/types
- geom_point() - scatter plot draws one point per row — ideal for two numeric variables.
- geom_histogram() - bar graph
- geom_smooth() - adds a regression line
- geom_boxplot() -
Importing data
# loads saved objects to the Environment
# load("filename_Workspace.RData") Can also import data via GUI interface, but packages are needed to make it possible. Refer to packages section to determine which.
CSV
- environment pane
- import dataset
- from text (provided by readr)
- find and select csv file
the console will output the code needed, can use this syntax or continue using GUI
# dataFrameNameYouWant <- read_csv("csv_fileName.csv")Excell (xlsx)
- environment pane
- import dataset
- from excel
- find and select the excel file
the console will output the code needed, can use this syntax or continue using GUI
# dataFrameNameYouWant <- read_excel("excellFileName.xls")SPS Sav file
- environment pane
- import dataset
- from spss
- find and select the excel file
the console will output the code needed, can use this syntax or continue using GUI
# datFrameNameYouWant <- read_sav("file_path/andfilename.sav")3-d object file
to import 3d mesh need full file path can be done two ways
#file.chose()or
- files pane
- navigate to file
- click triangle copy pat absolute path
Use that path to import 3dmeshes
human_skull <- readOBJ("filepath_found_above.obj")
once the stopsign in bottom pane ends, and data is imported can view 3d object with
open3d(); shade3d(human_skull, col = "white"); bg3d("black")
Cleaning data
dealing with NA
set.seed(3) #setting seed for reproduce ability
Ninety_Nine_Red_NAs <- sample(c(1:5, rep(NA, 99))) # creating a vector with 1-5 and 99 NA's in random order
print(Ninety_Nine_Red_NAs) [1] 5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[26] NA NA NA NA NA NA NA 3 NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA
[51] NA NA NA NA NA NA NA NA NA NA NA 4 NA NA NA NA NA NA 2 NA NA NA NA NA NA
[76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[101] NA NA NA NA
anyNA(Ninety_Nine_Red_NAs) #detect na's True, 1 or more NA exists, false no na's exist[1] TRUE
which(is.na(Ninety_Nine_Red_NAs)) #index of positions where NA exists [1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
[20] 21 22 23 24 25 26 27 28 29 30 31 32 34 35 36 37 38 39 40
[39] 41 42 43 44 45 46 48 49 50 51 52 53 54 55 56 57 58 59 60
[58] 61 63 64 65 66 67 68 70 71 72 73 74 75 76 77 78 79 80 81
[77] 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
[96] 101 102 103 104
which(!is.na(Ninety_Nine_Red_NAs)) # indexes of positions where NA doesn't exist[1] 1 33 47 62 69
Total_Missing_99NA <- sum(is.na(Ninety_Nine_Red_NAs)) # count how many na's exist
print(Total_Missing_99NA)[1] 99
Prop_Missing_99NA <- mean(is.na(Ninety_Nine_Red_NAs)) # mean of the values that do exist
print(Prop_Missing_99NA)[1] 0.9519231
vec_removed <- Ninety_Nine_Red_NAs[!is.na(Ninety_Nine_Red_NAs)] #remove na's just leave data
print(vec_removed)[1] 5 3 1 4 2
vec_imputed <- Ninety_Nine_Red_NAs # ▶ Make a copy to preserve original
vec_imputed[is.na(vec_imputed)] <- mean(vec_imputed, na.rm = TRUE) #put the mean in place of na's
print(vec_imputed) [1] 5 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[38] 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3 3 3 3 3 2 3 3 3 3 3
[75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
You can chose to remove the na’s or replace na’s with the mean value of what data exists.
Saving your work
You’ll want to save your .R file contains the code, and RData file which contains the objects To save the .R file you can do keyboard shortcut to save, as you go is advised. Hit the floppy disk icon or file save. To save the RData file, it’s easiest to do the following command in console. So far it’s advised to have Workspace suffix on the file name.
save.image("filename_Workspace.RData")This will save the file in the set working directory. You can find out which directory is set with
# getwd()If it’s not the location you want you can set the directory with
# setwd("file/path")You can also set the working directory in R studio at the often bottom right pane:
- files tab
- navigate to desired folder
- hit gear icon
- select set working directory (the console will output setwd(“filepath/you/chose”)
Viewing Data
food <- c("apple", "bannaba","dark chocolate") #this is a vector of chars
class(food) # tells you what type of data the vector is[1] "character"
rm(list=ls())# removes the data from the environment
# can also be done by clicking the broom icon (🧹) in environment tab and console job
# mtcars is a dataframe, head shows the first six rows of the dataframe.
# I use this to get idea of what data is there, mainly column names
head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
#can also define rows to see rather than 6
mtcars |> head(2) mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
# mtcars is dataframe with basic r, and shows row and column count
# I use this to get idea of how large datafrme is
dim(mtcars) [1] 32 11
#number of columns, if I need to know col number only
ncol(mtcars)[1] 11
#number of rows, if I need to know row number only
nrow(mtcars)[1] 32
# opens the dataframe in another tab so you can look at the
#table used to understand data and plan next steps used when i've never seen it
#View(mtcars)
# summary of each column's data in the dataframe,
# used if i need to make a decision based on range of information or to have
# a high level understanding
summary(mtcars) mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
# prints the dataframe in console or on page in qmd to get easy quick view of
#information, typically don't use dataframes in it but vectors
print(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
#summarize tells you what you define for each column
mtcars |> summarize(mean_mpg = mean(mpg)) mean_mpg
1 20.09062
#summarize can be done for multimple columns in one call, seperated by commas
mtcars|> summarize(mean_mpg = mean(mpg),
sd_mpg = sd(mpg),#standard deviation
n = n())#number of mean_mpg sd_mpg n
1 20.09062 6.026948 32
Rows
# how many rows in dataframe
nrow(mtcars) [1] 32
#filter keeps ROWS that match a condition
filter(mtcars, mpg >= 25) mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#keep rows that match a condition
more_25 <- mtcars |> filter(mpg >= 25)
print(more_25) mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Columns
# how many columns in dataframe
ncol(mtcars) [1] 11
#select/show cyl column data from mtcars dataframe
mtcars$cyl [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
# vector keep all rows but remove columns 8,9
mtcars[,-c(8,9)] mpg cyl disp hp drat wt qsec gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 4 2
#show data of dataframe with specific columns.
# mtcars |> select(mpg, cyl, hp) # same different syntax From dplyr
select(mtcars, mpg, cyl,hp) mpg cyl hp
Mazda RX4 21.0 6 110
Mazda RX4 Wag 21.0 6 110
Datsun 710 22.8 4 93
Hornet 4 Drive 21.4 6 110
Hornet Sportabout 18.7 8 175
Valiant 18.1 6 105
Duster 360 14.3 8 245
Merc 240D 24.4 4 62
Merc 230 22.8 4 95
Merc 280 19.2 6 123
Merc 280C 17.8 6 123
Merc 450SE 16.4 8 180
Merc 450SL 17.3 8 180
Merc 450SLC 15.2 8 180
Cadillac Fleetwood 10.4 8 205
Lincoln Continental 10.4 8 215
Chrysler Imperial 14.7 8 230
Fiat 128 32.4 4 66
Honda Civic 30.4 4 52
Toyota Corolla 33.9 4 65
Toyota Corona 21.5 4 97
Dodge Challenger 15.5 8 150
AMC Javelin 15.2 8 150
Camaro Z28 13.3 8 245
Pontiac Firebird 19.2 8 175
Fiat X1-9 27.3 4 66
Porsche 914-2 26.0 4 91
Lotus Europa 30.4 4 113
Ford Pantera L 15.8 8 264
Ferrari Dino 19.7 6 175
Maserati Bora 15.0 8 335
Volvo 142E 21.4 4 109
#mutating data in a column example changing number values to low medium or high
quake_recode <- quakes |>
mutate(mag = ifelse(mag <= 4.5, "low",
ifelse(mag <=5.5,"medium", "high")))
head(quake_recode) lat long depth mag stations
1 -20.42 181.62 562 medium 41
2 -20.62 181.03 650 low 15
3 -26.00 184.10 42 medium 43
4 -17.97 181.66 626 low 19
5 -20.42 181.96 649 low 11
6 -19.68 184.31 195 low 12
#it's best to add a new column with mutated data however
quake_recode_new <- quakes |>
mutate(mag_level = ifelse(mag <= 4.5, "low",
ifelse(mag <=5.5,"medium", "high")))
head(quake_recode_new) lat long depth mag stations mag_level
1 -20.42 181.62 562 4.8 41 medium
2 -20.62 181.03 650 4.2 15 low
3 -26.00 184.10 42 5.4 43 medium
4 -17.97 181.66 626 4.1 19 low
5 -20.42 181.96 649 4.0 11 low
6 -19.68 184.31 195 4.0 12 low
#get names of columns
colnames(quakes) [1] "lat" "long" "depth" "mag" "stations"
#change name of column
"scared_level"-> colnames(quake_recode_new)[which(names(quake_recode_new) == "mag_level")]
head(quake_recode_new,2) lat long depth mag stations scared_level
1 -20.42 181.62 562 4.8 41 medium
2 -20.62 181.03 650 4.2 15 low
#add column covered in dplyr sectionCombinations
Combination in this case means effecting columns, and rows.
# Scenario: Pick cars that weigh < 3,000 lb, have hp between 110 and 180, and
# cylinders IN {4,6}.
light_midpower <- mtcars[mtcars$wt < 3 & # weight is in 1000 lb
mtcars$hp >= 110 &
mtcars$hp <= 180 &
mtcars$cyl %in% c(4, 6), ]Built in datasets
- mtcars: Motor Trend Car Road Tests
- CO2: Carbon Dioxide Uptake in Grass Plants
- iris: Edgar Anderson’s Iris Data
- quakes: Locations of Earthquakes off Fiji
Data wrangling
#setting seed for reproducability
set.seed(12)
# Sample randomly chose based on arguments passed in
#1:10 is 1,2,3,4,5,6,7,8,9,10
# I didn't clarify how many i needed so did every number in random order
example_vec <- sample(1:10)
example_vec # so it prints [1] 2 7 3 6 5 9 4 10 8 1
#sum add it all up
sum(example_vec) [1] 55
#find the mean, add every number then divide by how many numbers
mean(example_vec)[1] 5.5
# tell me what type of data is in the vector
str(example_vec) int [1:10] 2 7 3 6 5 9 4 10 8 1
# standard deviation amount of variation of values to it's mean
sd(example_vec) [1] 3.02765
# middle value
median(example_vec)[1] 5.5
# sequence with no further arguments puts the data in an order
seq(example_vec) [1] 1 2 3 4 5 6 7 8 9 10
# remove the first item in the vector
example_vec[-1][1] 7 3 6 5 9 4 10 8 1
#create vector of negative numbers, to be used as indexes
list_remove <- c(-1,-3)
#remove by specific index
example_vec[list_remove][1] 7 6 5 9 4 10 8 1
# sapply() applies a function to each element of a list or vector
# in this case counting rows in each dataframe
# output as vector
sapply(list(mtcars, murders, heights), nrow)[1] 32 51 1050
# lapply() applies a function to each element of a list or vector
# output as list
lapply(list(mtcars, murders, heights), nrow)[[1]]
[1] 32
[[2]]
[1] 51
[[3]]
[1] 1050
# find unique instances of the data, in this case what
# diseases are in the disease column of us_contagious_diseases datafram
unique(us_contagious_diseases$disease) [1] Hepatitis A Measles Mumps Pertussis Polio Rubella
[7] Smallpox
Levels: Hepatitis A Measles Mumps Pertussis Polio Rubella Smallpox
If else
ifelse(test = c(1, 2, 3, 4, 5) > 3, yes = "Big", no = "Small") # like ternary in javascript[1] "Small" "Small" "Small" "Big" "Big"
object <- c("Yellow", "banna", "apple","red")
ifelse(object == "Yellow", yes="color", no="not a color") #depending on test recode the value to be as set for yes/true and no/false[1] "color" "not a color" "not a color" "not a color"
#multi nested
object2 <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
ifelse(object2 == 1, "One",
ifelse(object2 == 2, "Two",
ifelse(object2 == 3, "Three",
ifelse(object2 == 4, "Four","Five" )))) [1] "One" "One" "Two" "Two" "Three" "Three" "Four" "Four" "Five"
[10] "Five"
Getting time information
Sys.Date() # current date (yyyy‑mm‑dd)[1] "2025-06-02"
Sys.time() # current date + time[1] "2025-06-02 01:17:42 MDT"
Sys.timezone() # your local timezone string[1] "America/Boise"
Base R plotting
# role a dice 100 times, dice has six sides, when a number is rolled put it back (Replace) it to be rolled again
dice <- sample(1:6, 100, replace = TRUE)
#show a histogram, bar chart
hist(dice,
breaks = seq(0.5, 6.5, by = 1), # forcing it to be in bin order
col = "lightblue",# color of the bars
main = "100 Fair Die Rolls", # title of the plot #main is label of plot
xlab = "Face Value") # x axis label# breaks can have algorithms as setting also ex "Scott", "Sturges"Creating a function
# creating a vector of words, to look for
vector_to_loop_through <- c("trust","ethics")
vector_to_look_through <- c("Sentances from a paper, where you're trying to see if they talk about any key words you're looking for", "in this example looking for a sentance with the word trust, pulling the whole sentance to grab context" )
# create a list for every item in vector_to_loop_through go through vector_to_look_through and string_detect(look for and compare) if it exists, case doesn't matter
Index <- lapply(vector_to_loop_through, function(w) {
which(str_detect(vector_to_look_through, regex(paste0("\\b", w, "\\b"),
ignore_case = TRUE)))
})Create dataframe
#set seed reproducability
set.seed(8)
# 30 is the number of times replicated.
# 12 is replications of "marshmallow"
# Sample will mix it all up because size is not defined.
Type <- sample(c(rep("Cereal", 30), rep("Marshmallow", 12)))
# type is equal to cereal and plain
# sample x is c("heart"..) size is length(Type)
# ifelse(test) if this, then do it
Shape <- ifelse(Type == "Cereal", "Plain",
sample(c("Heart", "Star", "Horseshoe", "Clover",
"Blue_Moon", "Pot_of_Gold", "Rainbow",
"Red_Balloon"), length(Type), replace = TRUE))
#create a dataframe called snack_box with the type and shape vectors
snack_box <- data.frame(Type, Shape)
head(snack_box,2) Type Shape
1 Marshmallow Rainbow
2 Marshmallow Horseshoe
Troubleshooting
Getting help
help(sample)
?sample
#both open the help window pane, in this case for information about sample function Naming
this_style (snake 🐍 ) or thisStyle (camel 🐪 ) but this style with a space is not advised for csv and other data as well as vector and dataset names.
Object rules
- names have to start with a lettor or . not followed by a number
- can contain letters, numbers, _ and . but no spaces
- are case‑sensitive (Data vs data)
Dealing with NA
imputed - replacing na’s with the mean value
removal - remove the na’s completely
reflection: When finding mean of two vectors that had imputed and removal I didn’t expect the mean to be the same on first thought since i’m adding data in. But it’s mean so it doesn’t change.
Useful tip: na.omit() removes rows with any NA values
Importing files
Excell can have multiple sheets and you have to do a sheets call read_excell(“~album.xls”, sheet=“sheetname”)
Setting Seed
# you set a seed so you can create reproduceable results, the number at this time is not importaint
set.seed(123) qmd
- code blocks will run, and answers will appear right below them
- code block ticks ``` cannot have spaces in front of them, or it renders incorrectly
- enter after bold things if want text below the bolded item.
- suppressing messages for specific chunk {r message=FALSE}
- need space after the - for a bulletpoint to show up
helpful tips
- you can double click the data in environment pane to have the table open in a tab
- you can assign at the begining or end of the code
name_of_dataframe_or_vector <- sample(1:10, 5, replace=TRUE)
sample(1:10, 5, replace=TRUE) -> other_name_of_dataframe_or_vector- adding a pipe %>% or |> makes the code a pipeline
- can add unique() at the end of a pipeline, no argument needed since pipe passes it through
- can see list of pre-loaded data with function data() In r and packages loaded
- see specific number of rows in dataframe not in a pipeline can be done with head(dataframe,numberofrows)
Troubleshooting table
| Error | Solution 1 | Solution 2 |
|---|---|---|
| Error: object not found | didn’t run code to create object | |
| Error: unexpected symbol in “1name” | object 1name can’t start with a number | |
| Error: ! could not find function view | View needs to be capitalized | |
| Error: unexpected symbol (in a nested if else) | likely forgot a comma after yes assignment | |
| Error: argument “no” is missing, with no default | if else statement doesn’t have no condition set | |
Error: Caused by error in ifelse():! unused argument (0) |
have a comma in the yes or no assignment example 100,000 it’s reading the comma, remove it numbers cant have comma | |
| Error: could not find function “str_detect” | stringr isn’t loaded | |
Error in parse():! |
I have a period where a comma should be |