Biostatistics - Fall 2021 - Week 2: Data Structures and basics in R

Lab Objectives

In today’s lab, the objectives are to

Provide background information about how R manages data
Practice creating and manipulating data objects in R.
Practice importing and exporting csvs in R.
Practice entering data in R. 0

What is an R Object?

The R language is structured around a set of recognized objects. Objects include a wide range of things, such as functions, vectors, matrices, arrays, and lists.

A function can be thought of as a specialized tool; each function is a bundle of code designed to accomplish a specific task.

Objects such as vectors, data frames, matrices, arrays, and lists can be thought of as containers for storing different types of data.

Today, we will learn about functions, vectors, and data frames, the primary objects that we will work with throughout the semester.

Vectors

A vector is a one-dimensional object that stores a collection of data points. Data points in a vector can be numbers, letters, words, etc, but they must all be of the same type.

For example, we can concatenate (i.e., combine) several different numbers into a vector that we will call x using the assignment operator <-. The assignment operator is like an arrow indicating what we’d like to store where.

In R coding, it is conventional to take the thing on the right hand side of the expression and store it in the thing on the left hand side, like this:

x <- c(3,5,12,38,456)
x

## [1]   3   5  12  38 456

However, the arrow can also point in the other direction.

c('hello','world') -> y
y

## [1] "hello" "world"

Either way, the arrow must point toward the object that you are creating.

Functions

In R, a function is a named set of code that performs a specific task. You can perform that task by typing the function and (where applicable) specifying criteria (arguments) necessary for completing the desired task. Those arguments are typed within the parentheses.

The function str() tells you the structure of a specified object. Here, structure refers to data type. The required argument here is the name of the object you want to know the structure of. Let’s try that on objects x and y, which we created above:

str(x)

##  num [1:5] 3 5 12 38 456

str(y)

##  chr [1:2] "hello" "world"

This tells us that the structure of x is num (numerical), and the structure of y is chr (character). Numerical objects are objects that contain only numbers. Character objects are interpreted by R as text, and you cannot perform mathematical operations on them.

Subsetting

In R, subsetting a data object means selecting only a subset of data stored in that object. Vectors (and all other container objects) can be subset using indexing. In R, indexing refers to the position of a data point within the object.

In the case of vectors, you can use indexing to look up a specific item in the list based on where it is in the vector (1st item, 2nd item, 3rd item, etc.). For example, we can look at the 1st item in vector x in this way:

x[1]

## [1] 3

Here, you’re telling R that you want to subset x by typing square brackets after x, and you’re telling R that you want to look at the first item by placing a 1 within those brackets.

Basic Mathematical Operations

For numeric data, R can be used as a calculator. It’s simple enough to just write the expression as you would in a calculator.

4 * 5 + 72

## [1] 92

But we can also work with numeric data that are stored in a vector. For example, we can add the first two values in x a couple different ways:

x[1] + x[2]

## [1] 8

sum(x[1:2])

## [1] 8

The first way (x[1] + x[2]) asks R to add items 1 and 2 together, while the second option (sum(x[1:2])) asks R to compute the sum of items 1 through 2. In this case, they produce the same result.

Notice that this doesn’t work for character (text) data:

y[1] + y[2]

## Error in y[1] + y[2]: non-numeric argument to binary operator

Object Manipulation

Objects like the vectors x and y can be appended (i.e., combined) and overwritten.

To combine vectors, use the combine function, c(). To overwrite any object, simply use the same object name but tell R to fill it with something else.

Let’s try this out. We will first look at x, then combine x and y, overwriting the old x, then look at x again to see how it has changed.

## [1]   3   5  12  38 456

x <- c(x,y)
x

## [1] "3"     "5"     "12"    "38"    "456"   "hello" "world"

Notice that numbers contained in x are now enclosed in quotes. That is always a giveaway that R is treating those items as characters, even if they appear numerical to you. Use str() to see how R now recognizes the elements in this vector.

str(x)

##  chr [1:7] "3" "5" "12" "38" "456" "hello" "world"

sum(x[1:2])

## Error in sum(x[1:2]): invalid 'type' (character) of argument

Vector x is now interpreted by R as a character-type object, and we can no longer add the first two items together, even though they look like numbers. Why?

As mentioned above, vectors can have numerical or character structure but not both. When you combine vectors with these two different structures, R converts the numeric data to characters because numeric data can be expressed as text, but text can’t be expressed as numbers. When this happens, R can no longer do any computation on those numbers. Hence the error message above.

Remember: any time that R produces an error message, if you don’t understand what went wrong, you can find out by copying and pasting the error message into a web search engine. Try pasting “invalid ‘type’ (character) of argument” into a search engine and see what comes up.

Managing Your Global Environment

The list of objects in your global environment is the set of objects that you have created. When you exit RStudio, it will ask if you want to save your current workspace image. If you say “no,” the objects currently in your global environment will not be there when you next open RStudio, and you will have to create them again. If you say “yes,” all of those objects will still be in your environment next time that you open RStudio.

Note: You can create a project (File > New Project…), which stores a specific global environment, and each time that you switch between projects your global environment will be populated with the objects previously generated for that project. We do not require that you create RStudio projects for this course, but projects are one way to efficiently manage what’s in your global environment if you’re working on different assignments/tasks.

You will find that objects pile up quickly in your global environment, particularly when you are in the exploratory phase of coding. You can clear out all of the objects in your environment by clicking on the broom icon above the list of objects. To remove a specific object from the working environment, use the remove function rm()

rm(x)
x

## Error in eval(expr, envir, enclos): object 'x' not found

Here, we have removed x from our global environment. Notice that it is no longer in the list of objects, and when we try to call it up R lets us know that it doesn’t exist anymore (“object ‘x’ not found”).

Packages

In R, a package is like a tool kit… it is a set of functions with a common theme. To install a package, we use the install.packages() function. Let’s try this with a package called mosaic. We can’t do this in rmarkdown because it’s not in a code chunk, so copy and paste the following code into your console and hit “enter”:

install.packages(“mosaic”)

Notice that the name of the package must be in quotes. Every time that you start a new session in R, you must tell R which packages you want to use. There are a couple of ways you can do that, but let’s use the require() function:

require(mosaic)

## Warning in register(): Can't find generic `scale_type` in package ggplot2 to
## register S3 method.

Notice that when loading the package we no longer use quotes. That is because, once installed, each package becomes an object in the work environment.

But what does the mosaic package actually do? Here’s one way to find out:

?mosaic

This calls up a description in the help section of RStudio. You can use this method (question mark followed by the name) to look up information on any package or function that is in your environment, and in the case of functions the description will tell you which arguments are required, which are optional, and what they mean, and it will often include examples.

Many packages come with example data, usually so that users can test-run functions or do a vignette/walk through using example data. For example, the function data(name-of-data-set) will load a data set into the environment.

Each dataset usually includes some kind of narrative in the help associated with the dataset. You can use the same help query to find out more about datasets. Let’s try this with the dataset Births78, a sample dataset that comes with the mosaic package.

?Births78

Now bring the dataset into the global environment:

data(Births78)

We can visualize these data using the plot() function. The basic formula for this is plot(y~x,data), where y refers to the variable you want on the y-axis, x refers to the variable you want on the x-axis, and data is the dataset you are pulling these from. Here, x, y, and data are required, but we can add other arguments to make our plot look pretty:

plot(births~day_of_year,data=Births78,pch=16,col='blue')

Note: the argument “pch” simply defines what type of symbol you want to use in your plot, and “col” defines the color. For more on this, see: https://www.statmethods.net/advgraphs/parameters.html .

What do you think accounts for the “two” different patterns in the data?

Let’s plot the data again, but this time color code the dots by day of the week (col=wday). Here, we’re also adding a legend:

plot(births~day_of_year,data=Births78,pch=16,col=wday)
legend('top',horiz=T,inset=c(-0.1,-0.2),legend=levels(Births78$wday),
       pch=16,col=unique(Births78$wday),xpd=T)

Don’t worry about the specifications here, as you won’t be required to learn this level of technical detail for the class.

Seeing the data color-coded in this way, what do you think the two “waves” of data points represent?

Data Frames

Often, datasets are stored as a special type of object in R called a data frame. A data frame is a set of vectors, each of which can be of a different type of data (num, chr, etc.), but all vectors must be of the same length. Each column represents a variable/attribute, and each row represents an observation/individual.

We will use this convention for managing data throughout the course. Let’s ask for the structure of the Births78 data set.

str(Births78)

## 'data.frame':    365 obs. of  8 variables:
##  $ date        : Date, format: "1978-01-01" "1978-01-02" ...
##  $ births      : int  7701 7527 8825 8859 9043 9208 8084 7611 9172 9089 ...
##  $ wday        : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 1 2 3 4 5 6 7 1 2 3 ...
##  $ year        : num  1978 1978 1978 1978 1978 ...
##  $ month       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ day_of_year : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ day_of_month: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ day_of_week : num  1 2 3 4 5 6 7 1 2 3 ...

This time, str() told us not only the structure of the object (“data.frame”), but also the structure of each of the variables contained in that data frame. For example:

The variable date is formatted as a Date, a special structure that allows R to interpret month, year, and day information contained in a string of text that would otherwise be interpreted as a character.
The variable births is formated as int, which stands for integer. It is similar to the num structure, in that you can perform mathematical operations on an object with int structure, but it is different in that it can only contain whole numbers.
The variable wday is formated as an Ord.factor, which means that the variable is a factor on the ordinal scale.

We have also been provided with information on how many rows/observations Births78 contains (365), as well as how many columns/variables (8).

We can preview the dataset by clicking on its name in the list of objects, which will pull it up in a separate tab within RStudio. Alternatively, we can simply type the name of the data set at the console prompt.

This isn’t always the easiest way to see the data, however. Often there are too many samples or too many variables to fit nicely on the console output.

We can also just peek at the first few samples or the last few samples using the functions head() and tail(), respectively.

head(Births78)

##         date births wday year month day_of_year day_of_month day_of_week
## 1 1978-01-01   7701  Sun 1978     1           1            1           1
## 2 1978-01-02   7527  Mon 1978     1           2            2           2
## 3 1978-01-03   8825  Tue 1978     1           3            3           3
## 4 1978-01-04   8859  Wed 1978     1           4            4           4
## 5 1978-01-05   9043  Thu 1978     1           5            5           5
## 6 1978-01-06   9208  Fri 1978     1           6            6           6

tail(Births78)

##           date births wday year month day_of_year day_of_month day_of_week
## 360 1978-12-26   8902  Tue 1978    12         360           26           3
## 361 1978-12-27   9907  Wed 1978    12         361           27           4
## 362 1978-12-28  10177  Thu 1978    12         362           28           5
## 363 1978-12-29  10401  Fri 1978    12         363           29           6
## 364 1978-12-30   8474  Sat 1978    12         364           30           7
## 365 1978-12-31   8028  Sun 1978    12         365           31           1

We can also use indexing to subset data frames. This is similar to what we did with vectors, but now we’re working in two dimensions. To subset a data frame, you indicate both column(s) and row(s) (dataframe[row,column]).

For example, maybe you just want to find out what the first date on record was. Date is the first variable, so you would look it up this way:

Births78[1,1]

## [1] "1978-01-01"

Let’s say you want to look at all data for the 50’th observation (i.e., row). To do so, specify row but leave column blank:

Births78[50,]

##          date births wday year month day_of_year day_of_month day_of_week
## 50 1978-02-19   7695  Sun 1978     2          50           19           1

To look at all observations for a given variable, specify column but leave row blank:

Births78[,2]

##   [1]  7701  7527  8825  8859  9043  9208  8084  7611  9172  9089  9210  9259
##  [13]  9138  8299  7771  9458  9339  9120  9226  9305  7954  7560  9252  9416
##  [25]  9090  9387  8983  7946  7527  9184  9152  9159  9218  9167  8065  7804
##  [37]  9225  9328  9139  9247  9527  8144  7950  8966  9859  9285  9103  9238
##  [49]  8167  7695  9021  9252  9335  9268  9552  8313  7881  9262  9705  9132
##  [61]  9304  9431  8008  7791  9294  9573  9212  9218  9583  8144  7870  9022
##  [73]  9525  9284  9327  9480  7965  7729  9135  9663  9307  9159  9157  7874
##  [85]  7589  9100  9293  9195  8902  9318  8069  7691  9114  9439  8852  8969
##  [97]  9077  7890  7445  8870  9023  8606  8724  9012  7527  7193  8702  9205
## [109]  8720  8582  8892  7787  7304  9017  9077  9019  8839  9047  7750  7135
## [121]  8900  9422  9051  8672  9101  7718  7388  8987  9307  9273  8903  8975
## [133]  7762  7382  9195  9200  8913  9044  9000  8064  7570  9089  9210  9196
## [145]  9180  9514  8005  7781  7780  9630  9600  9435  9303  7971  7399  9127
## [157]  9606  9328  9075  9362  8040  7581  9201  9264  9216  9175  9350  8233
## [169]  7777  9543  9672  9266  9405  9598  8122  8091  9348  9857  9701  9630
## [181] 10080  8209  7976  9284  8433  9675 10184 10241  8773  8102  9877  9852
## [193]  9705  9984 10438  8859  8416 10026 10357 10015 10386 10332  9062  8563
## [205]  9960 10349 10091 10192 10307  8677  8486  9890 10145  9824 10128 10051
## [217]  8738  8442 10206 10442 10142 10284 10162  8951  8532 10127 10502 10053
## [229] 10377 10355  8904  8477  9967 10229  9900 10152 10173  8782  8453  9998
## [241] 10387 10063  9849 10114  8580  8355  8481 10023 10703 10292 10371  9023
## [253]  8630 10154 10425 10149 10265 10265  9170  8711 10304 10711 10488 10499
## [265] 10349  8735  8647 10414 10498 10344 10175 10368  8648  8686  9927 10378
## [277]  9928  9949 10052  8605  8377  9765 10351  9873  9824  9755  8554  7873
## [289]  9531  9938  9388  9502  9625  8411  7936  9425  9576  9328  9501  9537
## [301]  8415  8155  9457  9333  9321  9245  9774  8246  8011  9507  9769  9501
## [313]  9609  9652  8352  7967  9606 10014  9536  9568  9835  8432  7868  9592
## [325]  9950  9548  7915  9037  8275  8068  9825  9814  9438  9396  9592  8528
## [337]  8196  9767  9881  9402  9480  9398  8335  8093  9686 10063  9509  9524
## [349]  9951  8507  8172 10196 10605  9998  9398  9008  7939  7964  7846  8902
## [361]  9907 10177 10401  8474  8028

Another way to look up specific values is to use variable names. A quick way to find out what those are is with the names() function:

names(Births78)

## [1] "date"         "births"       "wday"         "year"         "month"       
## [6] "day_of_year"  "day_of_month" "day_of_week"

You can then look up all values for a particular variable….

Births78$wday

##   [1] Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed
##  [19] Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun
##  [37] Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
##  [55] Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon
##  [73] Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri
##  [91] Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue
## [109] Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat
## [127] Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed
## [145] Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun
## [163] Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
## [181] Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon
## [199] Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri
## [217] Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue
## [235] Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat
## [253] Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed
## [271] Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun
## [289] Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
## [307] Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon
## [325] Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri
## [343] Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Mon Tue
## [361] Wed Thu Fri Sat Sun
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

… or use indexing to look up a specific value.

Births78$wday[10]

## [1] Tue
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

Here, we’re using indexing to subset a specific colum (wday). Because each column in a dataframe is considered a vector, we only need to specify position along that vector.

Loading and Saving Data From Outside R

R is adept at data exchange, meaning that there are a lot of utilities for importing and exporting data in all sorts of formats. For example, see this list for importing from other statistical and statistical-like software formats:

https://www.statmethods.net/input/importingdata.html

and this one for exporting to those same formats:

https://www.statmethods.net/input/exportingdata.html

Today, we’ll get practice loading a .csv (comma separated values) file, and saving to a .csv file.

First, create a folder on your computer for this class.
On Canvas, download the classdata.2020.csv file and upload it into your newly created data folder. Make sure that your csv is in the same folder as your .Rmd file.
Now use the read.csv() function to import the file as a dataframe into R. R interprets letters literally. Any misspelling or case mismatches will result in an error. Make sure you type in exactly what that file name is inside the parentheses.

classdata.2020 <-read.csv("classdata.2020.csv")

To see if you successfully imported the file you can simple type the object name:

classdata.2020

##    gender height wingspan shoe.size hair.color eye.color random.number bed.time
## 1       F     63     64.0      41.0      black     brown             4     2330
## 2       M     67     70.0      39.0      brown     brown            17     2400
## 3       F     66     69.0      41.0      brown     brown             5     2300
## 4       M     75     78.0      45.0      brown     green            21     2200
## 5       M     71     73.0      45.0      black     black            24     2200
## 6       F     62     70.0      37.0      black     black             1     2300
## 7       M     69     65.0      41.0      brown      blue            22     2230
## 8       M     70     69.0      44.0      black     brown             8     2230
## 9       F     65     67.0      40.0      blond      blue             8     2300
## 10      F     65     63.0      39.0      blond     blond             7     2000
## 11      M     69     70.0      42.0      black      blue            21      600
## 12      F     70     70.0      41.0      brown     brown           582     2200
## 13      F     72     72.0      40.0      brown     hazel             9     2330
## 14      F     68     68.0      40.0      brown     brown             4     2100
## 15      F     64     64.0      37.0      brown      blue            24     2230
## 16      F     66     68.0      40.0      brown     brown             7     2330
## 17      F     69     69.2      41.0      brown     brown           477     2300
## 18      M     69     66.0      42.0      blond      blue            23     2200
## 19      M     70     73.0      44.0      brown      blue            10      200
## 20      M     71     73.0      47.5      brown     green            49     2300
## 21      F     67     61.0      38.5      blond     hazel             2     2200
## 22      M     75     74.5      46.0      brown     hazel            59     2300
## 23      M     74     74.0      46.0      black     brown            13     2200
## 24      F     68     54.0      40.0      brown     brown             5     2300
## 25      F     63     53.5      39.0      black     brown             9      100
## 26      M     70     64.0      44.5      brown     brown            19     2230
## 27      F     61     61.5      37.0      brown     brown             6     2030
## 28      M     69     69.0      42.0      black     black            88     2300
## 29      F     60     59.0      36.0      brown     brown             4     2300
## 30      F     60     59.0      36.0      brown     brown             4     2300
## 31      M     67     69.0      42.0      black     brown            15     2300
## 32      F     67     66.0      40.0      brown     green            13      900
## 33      M     71     71.0      41.0      brown      blue            17     2300
## 34      F     70     70.0      40.0      brown     green             5     2330
## 35      F     67     67.5      41.0      black     brown             7     2330
##    wake.time hair.cut.cost dinner.drink recitation.number
## 1        500            50        water                R1
## 2        830            28        water                R1
## 3        900            60        water                R1
## 4        600            35        water                R1
## 5        600             0        water                R1
## 6        600            15        water                R1
## 7        645            29        water                R1
## 8        800            15        water                R1
## 9        700            50        water                R1
## 10       600           200        water                R1
## 11       620            30      Ice Tea                R1
## 12       730            30      seltzer                R1
## 13       700            25        water                R1
## 14       600            25        water                R1
## 15       620            30        water                R1
## 16       700            50        water                R1
## 17       600            70        water                R2
## 18       700            30        water                R2
## 19       800             0     prosecco                R2
## 20       830            22        water                R2
## 21       600             0         milk                R2
## 22       900            30        water                R2
## 23       700            20        water                R2
## 24       500            40         milk                R2
## 25       800             0        water                R2
## 26       700            17        water                R2
## 27       700            12        water                R2
## 28       600            21        water                R2
## 29       700            30        water                R2
## 30       700            30        water                R2
## 31       730            15        water                R2
## 32       630           100       stella                R2
## 33       630            16        water                R2
## 34       530            40         milk                R2
## 35       730            16        water                R2

If you just want to take a peak at the file, you can use the function head() or tail(), which allows you to visualize the first or last six rows of the object, respectively.

head(classdata.2020)

##   gender height wingspan shoe.size hair.color eye.color random.number bed.time
## 1      F     63       64        41      black     brown             4     2330
## 2      M     67       70        39      brown     brown            17     2400
## 3      F     66       69        41      brown     brown             5     2300
## 4      M     75       78        45      brown     green            21     2200
## 5      M     71       73        45      black     black            24     2200
## 6      F     62       70        37      black     black             1     2300
##   wake.time hair.cut.cost dinner.drink recitation.number
## 1       500            50        water                R1
## 2       830            28        water                R1
## 3       900            60        water                R1
## 4       600            35        water                R1
## 5       600             0        water                R1
## 6       600            15        water                R1

tail(classdata.2020)

##    gender height wingspan shoe.size hair.color eye.color random.number bed.time
## 30      F     60     59.0        36      brown     brown             4     2300
## 31      M     67     69.0        42      black     brown            15     2300
## 32      F     67     66.0        40      brown     green            13      900
## 33      M     71     71.0        41      brown      blue            17     2300
## 34      F     70     70.0        40      brown     green             5     2330
## 35      F     67     67.5        41      black     brown             7     2330
##    wake.time hair.cut.cost dinner.drink recitation.number
## 30       700            30        water                R2
## 31       730            15        water                R2
## 32       630           100       stella                R2
## 33       630            16        water                R2
## 34       530            40         milk                R2
## 35       730            16        water                R2

The function str() gives you multiple details about the dataframe. It tells you how many rows (observations (obs.)) and how many columns (variables) are in your dataframe. R also tells you the nature of your data. For example, whether it is a character (meaning is it categorical), an integer (numerical), or sometimes a factor (another name for categorical).

str(classdata.2020)

## 'data.frame':    35 obs. of  12 variables:
##  $ gender           : chr  "F" "M" "F" "M" ...
##  $ height           : int  63 67 66 75 71 62 69 70 65 65 ...
##  $ wingspan         : num  64 70 69 78 73 70 65 69 67 63 ...
##  $ shoe.size        : num  41 39 41 45 45 37 41 44 40 39 ...
##  $ hair.color       : chr  "black" "brown" "brown" "brown" ...
##  $ eye.color        : chr  "brown" "brown" "brown" "green" ...
##  $ random.number    : int  4 17 5 21 24 1 22 8 8 7 ...
##  $ bed.time         : int  2330 2400 2300 2200 2200 2300 2230 2230 2300 2000 ...
##  $ wake.time        : int  500 830 900 600 600 600 645 800 700 600 ...
##  $ hair.cut.cost    : int  50 28 60 35 0 15 29 15 50 200 ...
##  $ dinner.drink     : chr  "water" "water" "water" "water" ...
##  $ recitation.number: chr  "R1" "R1" "R1" "R1" ...

The function names() tells you the names of the columns. These are your variables in your dataset

names(classdata.2020)

##  [1] "gender"            "height"            "wingspan"         
##  [4] "shoe.size"         "hair.color"        "eye.color"        
##  [7] "random.number"     "bed.time"          "wake.time"        
## [10] "hair.cut.cost"     "dinner.drink"      "recitation.number"

To view or specify a column in your dataframe, you can use the $ symbol.

classdata.2020$hair.color

##  [1] "black" "brown" "brown" "brown" "black" "black" "brown" "black" "blond"
## [10] "blond" "black" "brown" "brown" "brown" "brown" "brown" "brown" "blond"
## [19] "brown" "brown" "blond" "brown" "black" "brown" "black" "brown" "brown"
## [28] "black" "brown" "brown" "black" "brown" "brown" "brown" "black"

Say if I know that hair.color is column 5, I can subset the number of that column.

classdata.2020[5]

##    hair.color
## 1       black
## 2       brown
## 3       brown
## 4       brown
## 5       black
## 6       black
## 7       brown
## 8       black
## 9       blond
## 10      blond
## 11      black
## 12      brown
## 13      brown
## 14      brown
## 15      brown
## 16      brown
## 17      brown
## 18      blond
## 19      brown
## 20      brown
## 21      blond
## 22      brown
## 23      black
## 24      brown
## 25      black
## 26      brown
## 27      brown
## 28      black
## 29      brown
## 30      brown
## 31      black
## 32      brown
## 33      brown
## 34      brown
## 35      black

If I am interested in viewing or specifying a row in my dataframe, I can simply add a comma after the number:

classdata.2020[5,]

##   gender height wingspan shoe.size hair.color eye.color random.number bed.time
## 5      M     71       73        45      black     black            24     2200
##   wake.time hair.cut.cost dinner.drink recitation.number
## 5       600             0        water                R1

What if I want to see the distribution of hair cut costs for Biostats class of 2020? I can use the hist function to view that distribution.

hist(classdata.2020$hair.cut.cost)

Notice how the graph labels are not that intuitive. We are able to change the x and y axis labels and our plot title with something called arguments. Arguments are specifications that are used within functions and are often optional. In this example, we can use the xlab and ylab arguments within our hist() function to change our respective axis labels for our hair cut cost histogram and we can use the argument main to change our histogram title.

hist(classdata.2020$hair.cut.cost, xlab="Haircut cost in US Dollars ($)", ylab ="Frequency of haircut cost", main="Biostats haircut cost distribution in 2020")

Notice the changes? Pretty neat, huh? When changing the names of your plot title and axes, make sure that your text descriptor is in quotation marks, or else it will result in an error! If you are interested in viewing all the other available arguments in the hist() function, simply type ?hist into your console.

Assignment: Collect and enter data

Create a vector named “vector1” that includes the following: 4,5,6,2,3,1 and 8. What is the sum of vector1? What is the command to bring up the third value of vector1? Enter the code in the code box below, and provide the answers in comment form below the code chunk. (1pt)

vector1 <- c(4,5,6,2,3,1,8)
vector1

## [1] 4 5 6 2 3 1 8

sum(vector1[1:7])

## [1] 29

The sum of vector1 is 29 which was found by using the sum function.

vector1[3]

## [1] 6

The command that brings up the thrid value of vector1 is vector1[3], which gives the number 6.

How would you determine the mean, maximum and minimum of Vector1? (Hint: this may be best learned through google - e.g. type in “mean function R” into your search bar). Write the commands in the code box below. (1pt)

mean(vector1[1:7])

## [1] 4.142857

The mean of Vector1 is approximately 4.14.

max(vector1[1:7])

## [1] 8

The maximum value for Vector1 is 8.

min(vector1[1:7])

## [1] 1

The minimum value for Vector1 is 1.

Pick a continuous variable from the classdata.2020 object (something different that what is used in the above examples). Plot the histogram of the variable. Describe the distribution. (Hint: use the str() function to determine the data types of your variables)(1pt)

str(classdata.2020)

## 'data.frame':    35 obs. of  12 variables:
##  $ gender           : chr  "F" "M" "F" "M" ...
##  $ height           : int  63 67 66 75 71 62 69 70 65 65 ...
##  $ wingspan         : num  64 70 69 78 73 70 65 69 67 63 ...
##  $ shoe.size        : num  41 39 41 45 45 37 41 44 40 39 ...
##  $ hair.color       : chr  "black" "brown" "brown" "brown" ...
##  $ eye.color        : chr  "brown" "brown" "brown" "green" ...
##  $ random.number    : int  4 17 5 21 24 1 22 8 8 7 ...
##  $ bed.time         : int  2330 2400 2300 2200 2200 2300 2230 2230 2300 2000 ...
##  $ wake.time        : int  500 830 900 600 600 600 645 800 700 600 ...
##  $ hair.cut.cost    : int  50 28 60 35 0 15 29 15 50 200 ...
##  $ dinner.drink     : chr  "water" "water" "water" "water" ...
##  $ recitation.number: chr  "R1" "R1" "R1" "R1" ...

hist(classdata.2020$wake.time)

Based on the histogram, it seems that a greater number of students tend to wake between 500 and 700 hours.This can be seen in the peaks in data around 530, 600, and 630. The histogram also shows that less people seem to wake around the 700 to 900 range.

Now, using the code that you used to create the histogram in question 3, give your plot some descriptive information by changing the title and axis labels (2pts).

hist(classdata.2020$wake.time, xlab="Wake Time in Hours (Hr)", ylab="Frequency of Wake Times")

Enter code below that allows you to visualize a specific row in your classdata.2020 object. If you are unsure how to do this, revisit the subsetting section in this document. (2pt)

classdata.2020[9]

##    wake.time
## 1        500
## 2        830
## 3        900
## 4        600
## 5        600
## 6        600
## 7        645
## 8        800
## 9        700
## 10       600
## 11       620
## 12       730
## 13       700
## 14       600
## 15       620
## 16       700
## 17       600
## 18       700
## 19       800
## 20       830
## 21       600
## 22       900
## 23       700
## 24       500
## 25       800
## 26       700
## 27       700
## 28       600
## 29       700
## 30       700
## 31       730
## 32       630
## 33       630
## 34       530
## 35       730

Now let’s look at the heart rate data from lecture.

HRdata <-read.csv("HRdata.csv")
boxplot(HR ~ Sex, data = HRdata)

What are the variables in the boxplot above and what are their types (i.e. numerical, categorical, discrete, continuous) (2pts)

The variables are the sex of the person in the class as well as their corresponding heart rate. Sex is a categorical and discrete, wheras heart rate is numerical and continuous.

Interpret the figure. What are some trends that you notice? (1pt)

It seems as though men generally have a lower resting heart rate than females. Male heart rates seem to trend around 60-70 BPM whereas female heart rate seems to trend around 70-80 BPM. There also seems to be a larger margin of variance around the female heart rate as opposed to male heart rate. Additionally, the range around male heart rate is much larger than in females.

Lab 1 Submission Requirements due 3 September, 2021 @ 11:59pm on Canvas

Upload your knitted rmarkdown file. Pdf and html format are both acceptable, however, for pdfs, you will have to take extra steps. In order to knit to pdf, you will need to install additional software that helps to format documents. Follow the instructions below for your respective computer to be able to knit to pdf. otherwise, knitting directly to html will be the easiest option.

For Windows: You can install MikTex onto your computer before knitting to pdf. https://miktex.org/howto/install-miktex

For Mac: You can install MacTex: https://tug.org/mactex/mactex-download.html