Data and Code Ledger for DA 101

This is an R Markdown document. Please fill in all example text with your own words and definitions. For each function in your glossary, you should show your code. If you need extra code to do some data wrangling or bring in a new library, please hide it from the rendered html. Each term should have:

  1. What package does it come from?
  2. General definition - what does the function do and for what kind of tasks/situations would you use it?
  3. Worked example, using your own data set, or one of the example data sets from an R package
  4. Text below the example that explains specifically, what the code accomplished. In other words, what was the input and the output? What can you do now that you have run the example that you couldn’t before?

Load useful datasets

Data Wrangling

1. Compare and contrast summary, str, and glimpse

Package: dplyr

Definition: Summary shows us the variables neatly arranged in a table with length if categorical or central measures if numerical, but not the data themselves. str shows us the data structure with the strings arranged from first to last by variable with the number of total entries in the column (for msleep it is 83). glimpse only shows the strings but not the number of total entries in the column.

Code example:

summary(msleep)
##      name              genus               vore              order          
##  Length:83          Length:83          Length:83          Length:83         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  conservation        sleep_total      sleep_rem      sleep_cycle    
##  Length:83          Min.   : 1.90   Min.   :0.100   Min.   :0.1167  
##  Class :character   1st Qu.: 7.85   1st Qu.:0.900   1st Qu.:0.1833  
##  Mode  :character   Median :10.10   Median :1.500   Median :0.3333  
##                     Mean   :10.43   Mean   :1.875   Mean   :0.4396  
##                     3rd Qu.:13.75   3rd Qu.:2.400   3rd Qu.:0.5792  
##                     Max.   :19.90   Max.   :6.600   Max.   :1.5000  
##                                     NA's   :22      NA's   :51      
##      awake          brainwt            bodywt        
##  Min.   : 4.10   Min.   :0.00014   Min.   :   0.005  
##  1st Qu.:10.25   1st Qu.:0.00290   1st Qu.:   0.174  
##  Median :13.90   Median :0.01240   Median :   1.670  
##  Mean   :13.57   Mean   :0.28158   Mean   : 166.136  
##  3rd Qu.:16.15   3rd Qu.:0.12550   3rd Qu.:  41.750  
##  Max.   :22.10   Max.   :5.71200   Max.   :6654.000  
##                  NA's   :27
str(msleep)
## tibble [83 × 11] (S3: tbl_df/tbl/data.frame)
##  $ name        : chr [1:83] "Cheetah" "Owl monkey" "Mountain beaver" "Greater short-tailed shrew" ...
##  $ genus       : chr [1:83] "Acinonyx" "Aotus" "Aplodontia" "Blarina" ...
##  $ vore        : chr [1:83] "carni" "omni" "herbi" "omni" ...
##  $ order       : chr [1:83] "Carnivora" "Primates" "Rodentia" "Soricomorpha" ...
##  $ conservation: chr [1:83] "lc" NA "nt" "lc" ...
##  $ sleep_total : num [1:83] 12.1 17 14.4 14.9 4 14.4 8.7 7 10.1 3 ...
##  $ sleep_rem   : num [1:83] NA 1.8 2.4 2.3 0.7 2.2 1.4 NA 2.9 NA ...
##  $ sleep_cycle : num [1:83] NA NA NA 0.133 0.667 ...
##  $ awake       : num [1:83] 11.9 7 9.6 9.1 20 9.6 15.3 17 13.9 21 ...
##  $ brainwt     : num [1:83] NA 0.0155 NA 0.00029 0.423 NA NA NA 0.07 0.0982 ...
##  $ bodywt      : num [1:83] 50 0.48 1.35 0.019 600 ...
glimpse(msleep)
## Rows: 83
## Columns: 11
## $ name         <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater shor…
## $ genus        <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "Bra…
## $ vore         <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "carn…
## $ order        <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "Art…
## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "dome…
## $ sleep_total  <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0, 5…
## $ sleep_rem    <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8, …
## $ sleep_cycle  <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333, N…
## $ awake        <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0, 1…
## $ brainwt      <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000, 0…
## $ bodywt       <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.04…

Explanation: I used these three commands to compare how each visualization function looks and behaves to write the definitions.


Demonstrate each of the main dplyr verbs

  1. select

Package: dplyr

Definition: Picks out certain selected columns and removes the others

Code example:

msleepFiltered <- msleep %>% select(name, genus, vore)

Explanation: This gets rid of all the other columns except for the name, genus and diet of the animals, if we would like to do so for cleaner visuals.


  1. arrange

Package: dplyr

Definition: Lets you sort rows of a dataset

Code example:

msleepAZ <- arrange(msleep, order)

Explanation: This sorts all the rows by the order of the animal in alphabetical order.


  1. filter

Package: dplyr

Definition: lets you pick out certain parts of a dataset

Code example:

carni <- filter(msleep, vore == "carni")

Explanation: This returns only the animals in our data set that are carnivores.


  1. mutate

Package: dplyr

Definition: lets you add new columns based on existing ones

Code example:

msleep %>% mutate(sleep_avg = mean(sleep_total))
## # A tibble: 83 × 12
##    name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##    <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
##  1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA      11.9
##  2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA       7  
##  3 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
##  4 Great… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
##  5 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667  20  
##  6 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
##  7 North… Call… carni Carn… vu                   8.7       1.4       0.383  15.3
##  8 Vespe… Calo… <NA>  Rode… <NA>                 7        NA        NA      17  
##  9 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333  13.9
## 10 Roe d… Capr… herbi Arti… lc                   3        NA        NA      21  
## # … with 73 more rows, and 3 more variables: brainwt <dbl>, bodywt <dbl>,
## #   sleep_avg <dbl>

Explanation: This tells the computer to create a new variable, sleep_avg, that is the mean of all sleep totals and add it to the msleep data table.

  1. group_by

Package: dplyr

orders <- msleep %>% group_by(order) %>% summarize(n())

Definition: groups data by specific element and lets you get summary statistics when used with summarize

Code example:

Explanation: This tells R to organize the data by order of animal. This does not do anything on its own. The summarize function is necessary for it to complete the sorting.


  1. summarize

Package: dplyr

Definition: When used with group_by, can summarize a desired category with a single value.

Code example:

orders <- msleep %>% group_by(order) %>% summarize(n())

Explanation:

msleep tells the computer what data set to use. Group_by tells it to sort by order and summarize(n()) tells it to make the displayed value the number of data points in each order of animal.

8. Remove NAs from a data frame or from a column

drop_na Package: tidyr

Definition: Removes rows containing NA either from the entire data frame or a specific column.

Code example:

dropna <- drop_na(msleep, conservation)

Explanation: This drops all the animals whose conservation status is listed as “NA” from the dataset.


9. Use conditional statements to manipulate a dataframe

ifelse

Package: base R

Definition: returns a value based on whether the conditions of the test expression are met

Code example:

ifelse <- ifelse(msleep$vore == "carni", "true", "false")

Explanation: If the animal is a carnivore, the vector will return a “true” result. Otherwise it will return a “false” result.


10. Bring multiple datasets together by stacking them

rbind

Package: dplyr

Definition: can be used to combine vectors, matrices and data frames by rows

Code example:

newmatrix <- rbind(msleep$sleep_total, msleep$sleep_rem)

Explanation:


11. Bring multiple datasets together using merge and/or join

Package: hmisc

Definition: merges dataframes based on rows/columns in common

Code example:

economics <- economics
economics_long <- economics_long
economics_merged <- merge(economics, economics_long, by.x = "date")

Explanation: The “economics” and “economics_long” data frames are merged into one larger data frame based on their shared “date” column


12. Use Regular Expressions to manipulate data

function

Package: base R

Definition: allows the user to create their own operation

Code example:

fun <- function(x)(x+1)
fun(1)
## [1] 2

Explanation: I use the function command to perform the operation 1 + 1.


13. Transform and work with datetime values

as.date

Package: date

Definition: converts data to Julian calendar format

Code example:

library(date)
as.date("April 8, 2022")
## [1] 8Apr2022

Explanation: as.date converts our string into Day, Month, Year format by default.


14. Write your own function to automate a task

function

Package: base R

Definition: allows the user to create their own operation

Code example:

func <- function(x) (x*x)
func(1:50)
##  [1]    1    4    9   16   25   36   49   64   81  100  121  144  169  196  225
## [16]  256  289  324  361  400  441  484  529  576  625  676  729  784  841  900
## [31]  961 1024 1089 1156 1225 1296 1369 1444 1521 1600 1681 1764 1849 1936 2025
## [46] 2116 2209 2304 2401 2500

Explanation: The automation (x * x) is run 50 times with an original value of 1. 1 * 1 equals 1, 2 * 2 equals 4, 3 * 3 = 9 and so on.


Polished Data Visualization

15. Histogram

Package:

Definition: Form of data visualization used to plot the distribution of a continuous variable.

Code example:

hist <- ggplot(data=msleep,aes(x=sleep_total)) + geom_histogram(bins=20)

Explanation: ggplot invokes the ggplot2 package. We are using the “msleep” data set here, and the continuous variable “sleep_total.” bins=20 sets the size of the bars.


16. Bar plot

Package: ggplot2

Definition: Form of data visualization used to graph a categorical variable against a numeric variable

Code example:

bargraph <- ggplot(data=msleep, aes(x=vore, y=sleep_total, fill=vore)) + geom_bar(stat="identity")

Explanation: ggplot invokes the ggplot2 package. data=msleep as that is the data set we are using. aes gives our labels for the axes. The x-axis needs to be the categorical variable, so here we are going to use the animal’s eating habits as an example, and the y-axis needs to be a numeric variable, so we are going to use sleep time. Fill=“x axis” tells R to give each bar a different color. geom_bar specifies that the graph is going to be a bar graph and stat=“identity” tells R to have the heights of the bars represent their value (the default is stat=“bin”).


17. Box plot

Package: ggplot2

Definition: Plots the distribution of a numeric variable using quartiles

Code example:

boxplot <- ggplot(data=msleep, mapping=aes(x=vore, y=awake, fill=vore))+geom_boxplot()

Explanation: I am using the msleep dataset again. There are five boxplots here (one for each category) showing the distributions of how long each category of mammal is awake. “fill=vore” tells R to give each box a unique color.


18. Scatter plot

Package: ggplot2

Definition: Graphs one (either discrete or continuous) numeric variable against another

Code example:

scatter <- ggplot(data=msleep,aes(x=sleep_cycle,y=sleep_total)) + geom_point()

Explanation: ggplot invokes the ggplot2 package. data=msleep indicates we are using the “msleep” dataset. Sleep cycle will be our independent variable and total sleep time the dependent variable. geom_point() indicates the graph will be a scatterplot.


19. Line plot

() Package: ggplot2

Definition: Graphs a continuous numeric variable over a discrete numeric variable (almost always time)

Code example:

line <- ggplot(data=economics,aes(x=date,y=unemploy)) + geom_line()

Explanation: I used the economics dataset instead of the “msleep” dataset because “economics” has a time variable I can use. The time variable (date) should always be on the x-axis. For the y-axis I used the unemployment statistics. geom_line() specifies that the graph is a line graph.


20. Violin plot

Package: ggplot2

Definition: Plots the distribution of a numeric variable using density curves (wider widths for greater frequency and narrower widths for less frequency)

Code example:

violin <- ggplot(data=msleep, mapping=aes(x=vore, y=awake, fill=vore))+geom_violin()

Explanation: A violin plot is essentially the same as a box plot in terms of structure but with different shapes, so nothing, other than the ending tag, had to be changed from my boxplot example.


21. Your favorite other kind of ggplot, not yet demonstrated

geom_jitter

Package: ggplot2

Definition: Can add individual data points to your boxplots.

Code example:

jitter <- ggplot(data=msleep, mapping=aes(x=vore, y=awake, fill=vore))+ geom_boxplot() + geom_jitter()

Explanation: The original geom_boxplot function is necessary to superimpose the points over a boxplot. Otherwise the points from geom_jitter will be floating. —

Statistical and Analytical Tools

22. Use rnorm to generate normally distributed data with a particular sample size, mean, and standard deviation.

Package: compositions

Definition: Generates a random number using a normal distribution

Code example:

mean(msleep$sleep_total)
## [1] 10.43373
sd(msleep$sleep_total)
## [1] 4.450357
rnorm <- rnorm(83,10.43373,4.450357)

Explanation:


23. Resample with replacement using the resample and replicate functions

Package: resample

Definition:

Code example:

permutation <- function(x, nA) {
    idx_a <- sample(1:length(x), nA)
    idx_b <- setdiff(1:length(x), idx_a)
    meandifference <- mean(x[idx_a]) - mean(x[idx_b])
    return(meandifference)
}

Explanation:

24. Demonstrate how you could calculate several of these key summary statistics:

**mean, median, max, min, interquartile range, standard deviaton

Package: base R

Definition: mean: sum of all values divided by n max: maximum value min: minimum value interquartile range: difference between the 25th and 75th percentile of a data set standard deviation: measurement of how far data points are dispersed from the average value

Code example:

mean <- mean(msleep$sleep_total)
median <- median(msleep$sleep_total)
max <- max(msleep$sleep_total)
iqr <- IQR(msleep$sleep_total)
sd <- sd(msleep$sleep_total)

Explanation: all the statistical results have been saved as objects


25. Correlation versus correlation test

cor.test and cor_test`

Package: stats

Definition: measures the correlation between two variables on a scale of 0 to 1

Code example:

cor.test(msleep$bodywt,msleep$brainwt)
## 
##  Pearson's product-moment correlation
## 
## data:  msleep$bodywt and msleep$brainwt
## t = 19.176, df = 54, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8891642 0.9608114
## sample estimates:
##       cor 
## 0.9337822

Explanation: The correlation between the variables of body weight and brain weight is 0.933, pointing to a strong correlation. We would have an extremely low probability of getting this result if the null hypothesis were true, i.e. that the correlation between the variables is 0. Thus we can reject the null hypothesis here.


26. Shapiro-wilk test

shapiro.test

Package: stats

Definition: measures the extent to which a data set follows a normal distribution with 1 being perfectly normal

Code example:

shapiro.test(msleep$sleep_total)
## 
##  Shapiro-Wilk normality test
## 
## data:  msleep$sleep_total
## W = 0.97973, p-value = 0.2143
shapiro.test(msleep$brainwt)
## 
##  Shapiro-Wilk normality test
## 
## data:  msleep$brainwt
## W = 0.30082, p-value = 7.277e-15

Explanation: The “sleep_total” variable is given a W score of 0.98, indicating that the distribution has deviated slightly from a normal distribution. It may be slightly skewed. The “brainwt” variable is given a W score of 0.3008, indicating that the distribution is extremely non-normal. It may be bimodal, although we do not know the exact shape of the distribution from this score.


27. One sample t-test

Package: t.test {stats}

Definition: Checks whether the mean of a distribution is equal to 0

Code example:

t.test(msleep$brainwt)
## 
##  One Sample t-test
## 
## data:  msleep$brainwt
## t = 2.1581, df = 55, p-value = 0.03531
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.02009613 0.54306673
## sample estimates:
## mean of x 
## 0.2815814

Explanation: The mean of the dataset has a 95 percent chance of being between 0.02 and 0.54. Assuming the null hypothesis is correct, there is a 3.5 percent chance we would get this result, which is a mean of 0.28. This can or can not be sufficient to reject the null hypothesis depending on your parameters.


28. Two sample t-test

Package: t.test

Definition: Compares the means of two distributions

Code example:

primates <- msleep %>% filter(order=="Primates")
rodents <- msleep %>% filter(order=="Rodentia")
t.test(primates$brainwt,rodents$brainwt)
## 
##  Welch Two Sample t-test
## 
## data:  primates$brainwt and rodents$brainwt
## t = 1.7757, df = 8.0005, p-value = 0.1137
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.07482099  0.57590721
## sample estimates:
## mean of x mean of y 
## 0.2541111 0.0035680

Explanation: The difference in means between the two data sets has a 95 percent probability of being between -0.07 and 0.56. This gives us a p-value of 0.11, meaning that if the two means were actually equal, we would have a 11 percent chance of generating this result. We fail to reject the null hypothesis.


29. Simple linear regression (bivariate)

linear_reg`

Package: parsnip

Definition: creates a straight line based on the principle of “least squares” to model the relationship between two variables

Code example:

library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 0.2.0 ──
## ✓ broom        0.7.12     ✓ rsample      0.1.1 
## ✓ dials        0.1.0      ✓ tune         0.2.0 
## ✓ infer        1.0.0      ✓ workflows    0.2.6 
## ✓ modeldata    0.1.1      ✓ workflowsets 0.2.1 
## ✓ parsnip      0.2.1      ✓ yardstick    0.0.9 
## ✓ recipes      0.2.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
regression <- linear_reg() %>% set_engine("lm") %>% fit(msleep$bodywt~msleep$brainwt, data = msleep)

Explanation: “Regression” is the saved object for the linear model we have created to observe the change in body weight based on changing brain weight in the data set.


30. Multiple linear regression

linear_reg

Package: parsnip

Definition: creates a straight line based on the principle of “least squares” to model the relationship between one dependent variable and several independent variables

Code example:

multireg <- linear_reg() %>% set_engine("lm") %>% fit(brainwt~bodywt+sleep_total+sleep_cycle, data=msleep)

Explanation: Explanation: “Regression” is the saved object for the linear model we have created to observe the change in body weight based on changing brain weight, circadian rhythm, and duration of an individual sleep in the data set.


31. Multicollinearity

ggpairs

Package: GGally

Definition: ggpairs generates a pairwise plot matrix comparing the relationship between the selected variables.

Code example:

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
pairwise <- ggpairs(msleep, columns = c("sleep_total", "sleep_rem"))

Explanation: A matrix is generated showing the relationships between the “sleep_total” and “sleep_rem” variables.


32. Validate your model

autoplot

Package: precrec

Definition: visualizes an object

Code example:

library(precrec)
sscurves <- evalmod(scores = P10N10$scores, labels = P10N10$labels)
autoplot(sscurves)

Explanation: We graph the time-series object “sscurves


33. Make a histogram of your residuals

geom_histogram()

Package: ggplot2

Definition: graphs distribution of the difference of the regression line and the data points

Code example:

residuals <- ggplot(data = msleep, aes(regression$residuals)) + geom_histogram()

Explanation: Residuals are the difference between the expected value from a regression line and the actual value. The frequency of different residual values is graphed in this histogram.


34. Make predictions on new data using a model

predict

Package: raster

Definition: predicts the output for a certain input within the range of the dataset

Code example:

prediction <- predict(regression, 8.4)

Explanation: We predict what the value of “brainwt” will be given a brain weight of 8.4.


35. Make a Q-Q plot to better understand the distribution of a variable using geom_qq and stat_qq_line

Package: ggplot2

Definition: Stands for “quantile-quantile” plot. Q-Q plots identify the quantiles in your sample data and plot them against the quantiles of a theoretical distribution.

Code example:

qq <- ggplot(msleep, aes(sample = sleep_total)) +
  geom_qq() +
  stat_qq_line()

Explanation: Here the distribution of the variable “sleep_total” is plotted against a theoretical linear distribution.