This is an R Markdown document. Please fill in all example text with your own words and definitions. For each function in your glossary, you should show your code. If you need extra code to do some data wrangling or bring in a new library, please hide it from the rendered html. Each term should have:
summary, str, and glimpsePackage: dplyr
Definition: Summary shows us the variables neatly arranged in a table with length if categorical or central measures if numerical, but not the data themselves. str shows us the data structure with the strings arranged from first to last by variable with the number of total entries in the column (for msleep it is 83). glimpse only shows the strings but not the number of total entries in the column.
Code example:
summary(msleep)
## name genus vore order
## Length:83 Length:83 Length:83 Length:83
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## conservation sleep_total sleep_rem sleep_cycle
## Length:83 Min. : 1.90 Min. :0.100 Min. :0.1167
## Class :character 1st Qu.: 7.85 1st Qu.:0.900 1st Qu.:0.1833
## Mode :character Median :10.10 Median :1.500 Median :0.3333
## Mean :10.43 Mean :1.875 Mean :0.4396
## 3rd Qu.:13.75 3rd Qu.:2.400 3rd Qu.:0.5792
## Max. :19.90 Max. :6.600 Max. :1.5000
## NA's :22 NA's :51
## awake brainwt bodywt
## Min. : 4.10 Min. :0.00014 Min. : 0.005
## 1st Qu.:10.25 1st Qu.:0.00290 1st Qu.: 0.174
## Median :13.90 Median :0.01240 Median : 1.670
## Mean :13.57 Mean :0.28158 Mean : 166.136
## 3rd Qu.:16.15 3rd Qu.:0.12550 3rd Qu.: 41.750
## Max. :22.10 Max. :5.71200 Max. :6654.000
## NA's :27
str(msleep)
## tibble [83 × 11] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:83] "Cheetah" "Owl monkey" "Mountain beaver" "Greater short-tailed shrew" ...
## $ genus : chr [1:83] "Acinonyx" "Aotus" "Aplodontia" "Blarina" ...
## $ vore : chr [1:83] "carni" "omni" "herbi" "omni" ...
## $ order : chr [1:83] "Carnivora" "Primates" "Rodentia" "Soricomorpha" ...
## $ conservation: chr [1:83] "lc" NA "nt" "lc" ...
## $ sleep_total : num [1:83] 12.1 17 14.4 14.9 4 14.4 8.7 7 10.1 3 ...
## $ sleep_rem : num [1:83] NA 1.8 2.4 2.3 0.7 2.2 1.4 NA 2.9 NA ...
## $ sleep_cycle : num [1:83] NA NA NA 0.133 0.667 ...
## $ awake : num [1:83] 11.9 7 9.6 9.1 20 9.6 15.3 17 13.9 21 ...
## $ brainwt : num [1:83] NA 0.0155 NA 0.00029 0.423 NA NA NA 0.07 0.0982 ...
## $ bodywt : num [1:83] 50 0.48 1.35 0.019 600 ...
glimpse(msleep)
## Rows: 83
## Columns: 11
## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater shor…
## $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "Bra…
## $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "carn…
## $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "Art…
## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "dome…
## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0, 5…
## $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8, …
## $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333, N…
## $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0, 1…
## $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000, 0…
## $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.04…
Explanation: I used these three commands to compare how each visualization function looks and behaves to write the definitions.
dplyr verbsselectPackage: dplyr
Definition: Picks out certain selected columns and removes the others
Code example:
msleepFiltered <- msleep %>% select(name, genus, vore)
Explanation: This gets rid of all the other columns except for the name, genus and diet of the animals, if we would like to do so for cleaner visuals.
arrangePackage: dplyr
Definition: Lets you sort rows of a dataset
Code example:
msleepAZ <- arrange(msleep, order)
Explanation: This sorts all the rows by the order of the animal in alphabetical order.
filterPackage: dplyr
Definition: lets you pick out certain parts of a dataset
Code example:
carni <- filter(msleep, vore == "carni")
Explanation: This returns only the animals in our data set that are carnivores.
mutatePackage: dplyr
Definition: lets you add new columns based on existing ones
Code example:
msleep %>% mutate(sleep_avg = mean(sleep_total))
## # A tibble: 83 × 12
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Cheet… Acin… carni Carn… lc 12.1 NA NA 11.9
## 2 Owl m… Aotus omni Prim… <NA> 17 1.8 NA 7
## 3 Mount… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
## 4 Great… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
## 5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
## 6 Three… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
## 7 North… Call… carni Carn… vu 8.7 1.4 0.383 15.3
## 8 Vespe… Calo… <NA> Rode… <NA> 7 NA NA 17
## 9 Dog Canis carni Carn… domesticated 10.1 2.9 0.333 13.9
## 10 Roe d… Capr… herbi Arti… lc 3 NA NA 21
## # … with 73 more rows, and 3 more variables: brainwt <dbl>, bodywt <dbl>,
## # sleep_avg <dbl>
Explanation: This tells the computer to create a new variable, sleep_avg, that is the mean of all sleep totals and add it to the msleep data table.
group_byPackage: dplyr
orders <- msleep %>% group_by(order) %>% summarize(n())
Definition: groups data by specific element and lets you get summary statistics when used with summarize
Code example:
Explanation: This tells R to organize the data by order of animal. This does not do anything on its own. The summarize function is necessary for it to complete the sorting.
summarizePackage: dplyr
Definition: When used with group_by, can summarize a desired category with a single value.
Code example:
orders <- msleep %>% group_by(order) %>% summarize(n())
Explanation:
msleep tells the computer what data set to use. Group_by tells it to sort by order and summarize(n()) tells it to make the displayed value the number of data points in each order of animal.
drop_na Package: tidyr
Definition: Removes rows containing NA either from the entire data frame or a specific column.
Code example:
dropna <- drop_na(msleep, conservation)
Explanation: This drops all the animals whose conservation status is listed as “NA” from the dataset.
ifelse
Package: base R
Definition: returns a value based on whether the conditions of the test expression are met
Code example:
ifelse <- ifelse(msleep$vore == "carni", "true", "false")
Explanation: If the animal is a carnivore, the vector will return a “true” result. Otherwise it will return a “false” result.
rbind
Package: dplyr
Definition: can be used to combine vectors, matrices and data frames by rows
Code example:
newmatrix <- rbind(msleep$sleep_total, msleep$sleep_rem)
Explanation:
merge and/or joinPackage: hmisc
Definition: merges dataframes based on rows/columns in common
Code example:
economics <- economics
economics_long <- economics_long
economics_merged <- merge(economics, economics_long, by.x = "date")
Explanation: The “economics” and “economics_long” data frames are merged into one larger data frame based on their shared “date” column
function
Package: base R
Definition: allows the user to create their own operation
Code example:
fun <- function(x)(x+1)
fun(1)
## [1] 2
Explanation: I use the function command to perform the operation 1 + 1.
as.date
Package: date
Definition: converts data to Julian calendar format
Code example:
library(date)
as.date("April 8, 2022")
## [1] 8Apr2022
Explanation: as.date converts our string into Day, Month, Year format by default.
function
Package: base R
Definition: allows the user to create their own operation
Code example:
func <- function(x) (x*x)
func(1:50)
## [1] 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225
## [16] 256 289 324 361 400 441 484 529 576 625 676 729 784 841 900
## [31] 961 1024 1089 1156 1225 1296 1369 1444 1521 1600 1681 1764 1849 1936 2025
## [46] 2116 2209 2304 2401 2500
Explanation: The automation (x * x) is run 50 times with an original value of 1. 1 * 1 equals 1, 2 * 2 equals 4, 3 * 3 = 9 and so on.
Package:
Definition: Form of data visualization used to plot the distribution of a continuous variable.
Code example:
hist <- ggplot(data=msleep,aes(x=sleep_total)) + geom_histogram(bins=20)
Explanation: ggplot invokes the ggplot2 package. We are using the “msleep” data set here, and the continuous variable “sleep_total.” bins=20 sets the size of the bars.
Package: ggplot2
Definition: Form of data visualization used to graph a categorical variable against a numeric variable
Code example:
bargraph <- ggplot(data=msleep, aes(x=vore, y=sleep_total, fill=vore)) + geom_bar(stat="identity")
Explanation: ggplot invokes the ggplot2 package. data=msleep as that is the data set we are using. aes gives our labels for the axes. The x-axis needs to be the categorical variable, so here we are going to use the animal’s eating habits as an example, and the y-axis needs to be a numeric variable, so we are going to use sleep time. Fill=“x axis” tells R to give each bar a different color. geom_bar specifies that the graph is going to be a bar graph and stat=“identity” tells R to have the heights of the bars represent their value (the default is stat=“bin”).
Package: ggplot2
Definition: Plots the distribution of a numeric variable using quartiles
Code example:
boxplot <- ggplot(data=msleep, mapping=aes(x=vore, y=awake, fill=vore))+geom_boxplot()
Explanation: I am using the msleep dataset again. There are five boxplots here (one for each category) showing the distributions of how long each category of mammal is awake. “fill=vore” tells R to give each box a unique color.
Package: ggplot2
Definition: Graphs one (either discrete or continuous) numeric variable against another
Code example:
scatter <- ggplot(data=msleep,aes(x=sleep_cycle,y=sleep_total)) + geom_point()
Explanation: ggplot invokes the ggplot2 package. data=msleep indicates we are using the “msleep” dataset. Sleep cycle will be our independent variable and total sleep time the dependent variable. geom_point() indicates the graph will be a scatterplot.
() Package: ggplot2
Definition: Graphs a continuous numeric variable over a discrete numeric variable (almost always time)
Code example:
line <- ggplot(data=economics,aes(x=date,y=unemploy)) + geom_line()
Explanation: I used the economics dataset instead of the “msleep” dataset because “economics” has a time variable I can use. The time variable (date) should always be on the x-axis. For the y-axis I used the unemployment statistics. geom_line() specifies that the graph is a line graph.
Package: ggplot2
Definition: Plots the distribution of a numeric variable using density curves (wider widths for greater frequency and narrower widths for less frequency)
Code example:
violin <- ggplot(data=msleep, mapping=aes(x=vore, y=awake, fill=vore))+geom_violin()
Explanation: A violin plot is essentially the same as a box plot in terms of structure but with different shapes, so nothing, other than the ending tag, had to be changed from my boxplot example.
geom_jitter
Package: ggplot2
Definition: Can add individual data points to your boxplots.
Code example:
jitter <- ggplot(data=msleep, mapping=aes(x=vore, y=awake, fill=vore))+ geom_boxplot() + geom_jitter()
Explanation: The original geom_boxplot function is necessary to superimpose the points over a boxplot. Otherwise the points from geom_jitter will be floating. —
rnorm to generate normally distributed data with a particular sample size, mean, and standard deviation.Package: compositions
Definition: Generates a random number using a normal distribution
Code example:
mean(msleep$sleep_total)
## [1] 10.43373
sd(msleep$sleep_total)
## [1] 4.450357
rnorm <- rnorm(83,10.43373,4.450357)
Explanation:
resample and replicate functionsPackage: resample
Definition:
Code example:
permutation <- function(x, nA) {
idx_a <- sample(1:length(x), nA)
idx_b <- setdiff(1:length(x), idx_a)
meandifference <- mean(x[idx_a]) - mean(x[idx_b])
return(meandifference)
}
Explanation:
**mean, median, max, min, interquartile range, standard deviaton
Package: base R
Definition: mean: sum of all values divided by n max: maximum value min: minimum value interquartile range: difference between the 25th and 75th percentile of a data set standard deviation: measurement of how far data points are dispersed from the average value
Code example:
mean <- mean(msleep$sleep_total)
median <- median(msleep$sleep_total)
max <- max(msleep$sleep_total)
iqr <- IQR(msleep$sleep_total)
sd <- sd(msleep$sleep_total)
Explanation: all the statistical results have been saved as objects
cor.test and cor_test`
Package: stats
Definition: measures the correlation between two variables on a scale of 0 to 1
Code example:
cor.test(msleep$bodywt,msleep$brainwt)
##
## Pearson's product-moment correlation
##
## data: msleep$bodywt and msleep$brainwt
## t = 19.176, df = 54, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8891642 0.9608114
## sample estimates:
## cor
## 0.9337822
Explanation: The correlation between the variables of body weight and brain weight is 0.933, pointing to a strong correlation. We would have an extremely low probability of getting this result if the null hypothesis were true, i.e. that the correlation between the variables is 0. Thus we can reject the null hypothesis here.
shapiro.test
Package: stats
Definition: measures the extent to which a data set follows a normal distribution with 1 being perfectly normal
Code example:
shapiro.test(msleep$sleep_total)
##
## Shapiro-Wilk normality test
##
## data: msleep$sleep_total
## W = 0.97973, p-value = 0.2143
shapiro.test(msleep$brainwt)
##
## Shapiro-Wilk normality test
##
## data: msleep$brainwt
## W = 0.30082, p-value = 7.277e-15
Explanation: The “sleep_total” variable is given a W score of 0.98, indicating that the distribution has deviated slightly from a normal distribution. It may be slightly skewed. The “brainwt” variable is given a W score of 0.3008, indicating that the distribution is extremely non-normal. It may be bimodal, although we do not know the exact shape of the distribution from this score.
Package: t.test {stats}
Definition: Checks whether the mean of a distribution is equal to 0
Code example:
t.test(msleep$brainwt)
##
## One Sample t-test
##
## data: msleep$brainwt
## t = 2.1581, df = 55, p-value = 0.03531
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.02009613 0.54306673
## sample estimates:
## mean of x
## 0.2815814
Explanation: The mean of the dataset has a 95 percent chance of being between 0.02 and 0.54. Assuming the null hypothesis is correct, there is a 3.5 percent chance we would get this result, which is a mean of 0.28. This can or can not be sufficient to reject the null hypothesis depending on your parameters.
Package: t.test
Definition: Compares the means of two distributions
Code example:
primates <- msleep %>% filter(order=="Primates")
rodents <- msleep %>% filter(order=="Rodentia")
t.test(primates$brainwt,rodents$brainwt)
##
## Welch Two Sample t-test
##
## data: primates$brainwt and rodents$brainwt
## t = 1.7757, df = 8.0005, p-value = 0.1137
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.07482099 0.57590721
## sample estimates:
## mean of x mean of y
## 0.2541111 0.0035680
Explanation: The difference in means between the two data sets has a 95 percent probability of being between -0.07 and 0.56. This gives us a p-value of 0.11, meaning that if the two means were actually equal, we would have a 11 percent chance of generating this result. We fail to reject the null hypothesis.
linear_reg`
Package: parsnip
Definition: creates a straight line based on the principle of “least squares” to model the relationship between two variables
Code example:
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 0.2.0 ──
## ✓ broom 0.7.12 ✓ rsample 0.1.1
## ✓ dials 0.1.0 ✓ tune 0.2.0
## ✓ infer 1.0.0 ✓ workflows 0.2.6
## ✓ modeldata 0.1.1 ✓ workflowsets 0.2.1
## ✓ parsnip 0.2.1 ✓ yardstick 0.0.9
## ✓ recipes 0.2.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
regression <- linear_reg() %>% set_engine("lm") %>% fit(msleep$bodywt~msleep$brainwt, data = msleep)
Explanation: “Regression” is the saved object for the linear model we have created to observe the change in body weight based on changing brain weight in the data set.
linear_reg
Package: parsnip
Definition: creates a straight line based on the principle of “least squares” to model the relationship between one dependent variable and several independent variables
Code example:
multireg <- linear_reg() %>% set_engine("lm") %>% fit(brainwt~bodywt+sleep_total+sleep_cycle, data=msleep)
Explanation: Explanation: “Regression” is the saved object for the linear model we have created to observe the change in body weight based on changing brain weight, circadian rhythm, and duration of an individual sleep in the data set.
ggpairs
Package: GGally
Definition: ggpairs generates a pairwise plot matrix comparing the relationship between the selected variables.
Code example:
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
pairwise <- ggpairs(msleep, columns = c("sleep_total", "sleep_rem"))
Explanation: A matrix is generated showing the relationships between the “sleep_total” and “sleep_rem” variables.
autoplot
Package: precrec
Definition: visualizes an object
Code example:
library(precrec)
sscurves <- evalmod(scores = P10N10$scores, labels = P10N10$labels)
autoplot(sscurves)
Explanation: We graph the time-series object “sscurves
geom_histogram()
Package: ggplot2
Definition: graphs distribution of the difference of the regression line and the data points
Code example:
residuals <- ggplot(data = msleep, aes(regression$residuals)) + geom_histogram()
Explanation: Residuals are the difference between the expected value from a regression line and the actual value. The frequency of different residual values is graphed in this histogram.
predict
Package: raster
Definition: predicts the output for a certain input within the range of the dataset
Code example:
prediction <- predict(regression, 8.4)
Explanation: We predict what the value of “brainwt” will be given a brain weight of 8.4.
geom_qq and stat_qq_linePackage: ggplot2
Definition: Stands for “quantile-quantile” plot. Q-Q plots identify the quantiles in your sample data and plot them against the quantiles of a theoretical distribution.
Code example:
qq <- ggplot(msleep, aes(sample = sleep_total)) +
geom_qq() +
stat_qq_line()
Explanation: Here the distribution of the variable “sleep_total” is plotted against a theoretical linear distribution.