Welcome to Principles of Biology 2

This manual is used to guide your coding journey as you learn how to use R programming

Week 3: Intro to R Programming

Setting up R and RStudio

Let’s make sure you have R set up and ready to go! Here is some code to help you through it. You must have R and Rstudio set up before you can move on.

If you have not previously done so, go to this webpage and download R: http://cran.r-project.org/

Then go to this webpage and download Rstudio: http://www.rstudio.com/products/RStuido

Once you have done this, open RStudio.

Setting up R Markdown

Now we are going to start by getting R markdown all set up. R markdown is a cool and useful tool that allows you to write code, submit your code, and also upload your results all at one time, as a web browser document! It makes it easier to grade, spot mistakes, and allows you to save your submitted assignments. It can take a bit to learn, but I promise it is worth it!

Here are some helpful resources in aiding you to understand R Markdown.

Video Tutorial: https://rmarkdown.rstudio.com/authoring_quick_tour.html

Help Page: https://rmarkdown.rstudio.com/lesson-15.HTML

Here is a PDF version that I find really helpful! https://posit.co/wp-content/uploads/2022/10/rmarkdown-1.pdf

#Run the following code in your console (directly in the code running area) to download the package that allows you to create R markdown files. This code will download the package if you do not have it, and skip the download if you have already done so previously.

if (!require("rmarkdown")) {
  install.packages("rmarkdown", dependencies = TRUE)
  library(rmarkdown)
}

if (!require("knitr")) {
  install.packages("knitr")
  library(knitr)
}

if (!require("tinytex")) {
  install.packages("tinytex")
  library(tinytex)
  tinytex::install_tinytex()  # Install TinyTeX distribution
}

Click on the icon that looks like a white page with a little “+” on it on the top left hand corner of the program, and select “R Markdown”. This choice of file type is very important. From this point on we will be working in the “Code Editor”. This is the only way to save your work! Do not write code directly in the console.

You can erase everything after line 6. But make sure you leave the header (lines 1-5) on there!

Replace the “Untitled1” title with “YOURLASTNAME POB2 LAB - SECTION XX”.
Click the little disk to save the file, and name it the same thing. Be very selective of where you save the file, as all of your data will need to be stored in the same location. I recommend creating a folder on your desktop for this course, saving this file there, and exclusively using that folder moving forward. And remember to save your work often.
Place the following code in the file right under the header, but remove the leading hash tags. I have to put those in for the code to display to you. Any code that you write has to go between these two lines, or it will not run or display to me. Anything you write outside of this is interpreted as plain text instead of code. This means that outside of this region, you can write notes, or write me messages just like you would in a word document. You do not need hashtags. It can allow you to keep really neat files (see the help PDF for organization tricks).

This is what we do to create a new “chunk” of code.

#```{r,message=F, warning=FALSE}


#```

In the case that you want to download a package for use for the first time,you can use the following code:

The very first time you use a package, get a new computer, or update R, you will need to reinstall your packages to the R program. The general code for installing a package is install.packages(). An example can be seen below. Notice that you do need quotation marks around the package name at this stage.

Please note that the text in GREEN will need to be changed by you to list the proper package name.

install.packages("ggplot2")

After you install a package, or restart R, you will need to load the library. Note that this must be done each time you open R, not just after installing the package. This is done with the general code library(). Notice you do not need the quotations around the package name at this stage.

        #Insert package name here.
library(ggplot2)

Self-check Objective 1: Have a code chunk set up in which you load all the assigned packages. Name this code chunk “Loading Packages”. To give the chunk a title, write “# Loading Packages” above the code chunk. The hashtag tells it to create a title for the chunk. Try to include “tidyverse”, “dplyr”, and “ggplot2” for this self-check.

If you forget to read in the library of a package before you run a command that requires that package, you will get an error similar to this one: “Error in ggplot(): could not find function”ggplot”.

When you get an error that states it cannot find the function, you want to check two things (1) you loaded the library for the function and (2) there are no typos in the function name. If you are not sure which packages/library holds the function, you can find out by using the command

# type a question mark followed by the function command you want to check
 ?ggplot

This will print the help page for the function. At the very top of the page you will see the function and package written as “function {package name}”. This page also includes helpful information on how to format the code for the function and examples of how to write it.

Self-check Objective 2: In a new code chunk labeled “Function Search”, search the command function for “read.csv” and report its description as well as its proper formatting as plain text below your code chunk.

Formatting and Reading in Data from a CSV File

This is the next thing that we will be doing! You will be reading in some example data in CSV format, and using that data to learn how to edit data in R.

We often give you data that is in “wide” format. R handles data that is in “long” format more efficiently for the analyses that you do.

For more information on wide vs long data, check this out>

Make sure you data is in long format before you try to read it into R. This means that each column of your dataset should contain a single variable, and one variable should not be spread out over multiple columns. If the data is in the proper format, we use this code to read in the data: Rdataname=read.csv("filename.csv", header =T). An example is shown below.

#You can name the data anything you'd like. But you are going to reference it many times as you move forward. 
#So it should be easy to type and intuitive. In this case, I named mine "ExDat" because it is the example dataset. 
      #We use "read.csv" because it is a CSV file type. There are other options, but this is the one we use. 
ExDat=read.csv("ExDataFile.csv", header =T)
                #This file name must match EXACTLY as it appears in your folder. 
                                # header=T tells R that the very first row of your excel CSV file actually is your variable names, and not a data point.

In order to read in your data, the downloaded data MUST be in the same folder in which your RMarkdown is saved. Go to BlackBoard, and download the “ExDataFile.csv” file. You want to move this from your downloads into the folder in which you have your RMarkdown file (this one) saved or you will get an error that looks like this:

Error in file(file, “rt”) : cannot open the connection In addition: Warning message: In file(file, “rt”) : cannot open file ‘ExDataFile.csv’: No such file or directory

Some common misteps here are misusing the header=T part of the code or forgetting the “.csv” at the end of the file name. So watch out for those potential issues. If your code has run properly, you should see the dataframe listed in your environment (top right hand corner of the screen), and the following code should let you preview your data:

#please note that you will have to change "ExDat" to whatever you named your specific dataframe
head(ExDat)

##   Categorical Subcategory Numerical1 Numerical2 Integer
## 1           A           1       0.08       0.25       1
## 2           B           2       0.10       0.41       2
## 3           C           1       0.10       0.66       1
## 4           D           2       0.53       0.04       4
## 5           A           1       0.00       0.69       2
## 6           B           1       0.54       0.73       4

Sometimes the way that we enter data in excel can be misleading for R. Remember, R is simply a computer program. It interprets the data and code exactly as it is written, sometimes to a fault. So before you move on, you want to check and make sure that the data is all in the proper format (per column). For example, certain statistical tests and graphs require either categorical or continuous data. So you want to be sure that your data is being interpreted correctly. In order to check the “class” assigned to each variable, you can use the code str(), which prints the “structure” of the dataframe.

#please note that you will have to change "ExDat" to whatever you named your specific dataframe
str(ExDat)

## 'data.frame':    41 obs. of  5 variables:
##  $ Categorical: chr  "A" "B" "C" "D" ...
##  $ Subcategory: int  1 2 1 2 1 1 2 1 2 1 ...
##  $ Numerical1 : num  0.08 0.1 0.1 0.53 0 0.54 0.6 0.23 0.4 0.67 ...
##  $ Numerical2 : num  0.25 0.41 0.66 0.04 0.69 0.73 0.73 0.87 0.38 0.72 ...
##  $ Integer    : int  1 2 1 4 2 4 5 2 5 3 ...

Self-check Objective 3: In a new code chunk labeled “Structure Identification”, identify the data structure of each variable.

At this point, the Subcategory column is being read as an “int” or integer. We want it to be a categorical variable. In order to change from one class to another, we can use the general code format of DFname$Variable=as.class(DFname$Variable) where you enter your dataframe name as the “DFname”, the variable you want to change as the “Variable”, and you put the type of data you want it to be classified as in the “as.class” portion. An example can be seen below:

#I want to change the variables "category" and "subcategory" in the dataframe "ExDat" to a factor (categorical) based variable. This code tells it to look at that variable, change it to a factor, and overwrite the values in the original dataframe with the new factor based values. 
ExDat$Categorical=as.factor(ExDat$Categorical)
ExDat$Subcategory=as.factor(ExDat$Subcategory)

If you want to go to/from other data types, you can start to type as. and the list of class options will pop-up for you to choose from. If you are not sure which one to use, you can always double check by using the ? before the function to read the help summary. Some commonly used options are: 1. as.factor() 2. as.numeric() 3. as.integer() 4. as.character()

For more information on R variable types and how to choose the right one, check this out

After recategorizing a variable, you want to double check that it worked by looking at the structure of your data again.

#please note that you will have to change "ExDat" to whatever you named your specific dataframe
str(ExDat)

## 'data.frame':    41 obs. of  5 variables:
##  $ Categorical: Factor w/ 4 levels "A","B","C","D": 1 2 3 4 1 2 3 4 1 2 ...
##  $ Subcategory: Factor w/ 2 levels "1","2": 1 2 1 2 1 1 2 1 2 1 ...
##  $ Numerical1 : num  0.08 0.1 0.1 0.53 0 0.54 0.6 0.23 0.4 0.67 ...
##  $ Numerical2 : num  0.25 0.41 0.66 0.04 0.69 0.73 0.73 0.87 0.38 0.72 ...
##  $ Integer    : int  1 2 1 4 2 4 5 2 5 3 ...

You can see that the data is now set up with “Category” and “Subcategory” as factor based variables, with “Numerical1” and “Numerical2” as “num” or numerical variables. Integer data remains an “int” or integer. We are now ready to move on.

Self-check Objective 4: In a new code chunk labeled “Structure Reclassified” include the code showing that you were able to restructure the variable as instructed above.

Producing Summary Statistics

There are a number of reasons we produce summary statistics. The first is simply that it gives us a quick view into how our data might look. The second is that when we run statistical tests, they often produce p-values and test statistics that aren’t particularly intuitive. We needs descriptive statistics of the data in order to explain the results to someone in a way they will understand.

One way to get a quick view of summary statistics is by using the summary() command. Using this command, you will get a results print out showing summary statistics for each column in your dataframe. For continuous data (numerical and integer) you will get the Mean, Median, Min, Max, and Quantiles for each column. For categorical data, you will get the number of observations within each group.

#please note that you will have to change "ExDat" to whatever you named your specific dataframe
summary(ExDat)

##  Categorical Subcategory   Numerical1       Numerical2        Integer     
##  A:11        1:24        Min.   :0.0000   Min.   :0.0400   Min.   :1.000  
##  B:10        2:17        1st Qu.:0.2300   1st Qu.:0.2900   1st Qu.:2.000  
##  C:10                    Median :0.5300   Median :0.5300   Median :3.000  
##  D:10                    Mean   :0.4961   Mean   :0.5332   Mean   :3.049  
##                          3rd Qu.:0.7200   3rd Qu.:0.7300   3rd Qu.:4.000  
##                          Max.   :0.9800   Max.   :0.9900   Max.   :5.000

However, when you have large dataframes, sometimes this is way more detail that you want or need. Sometimes you just want summary statistics for a specific column. To do that, you can specify a variable: summary(DFname$Variable) or class(DFname$Variable)

#please note that you will have to change "ExDat" to whatever you named your specific dataframe and Numerical1 to whatever variable you want descriptive statistics for
summary(ExDat$Numerical1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2300  0.5300  0.4961  0.7200  0.9800

#This can also be done using the "class" command
class(ExDat$Numerical1)

## [1] "numeric"

You will notice that this provided a single mean for the whole column of “Numerical1”. But what if I want to know the mean for each Category? In that case, we use a specialized command called SummarySE() that is housed within the Rmics package. You will need to install and load the package as we discussed before. I recommend that you add it to your list of packages to be loaded each time you open R.

#Remember, you only need to install the package the first time you use it, if you get a new computer, or if you update R. 
install.packages("Rmisc")

#Load the library. 
#You will need to do this each and every time you open R. 
library(Rmisc)

You can now setup the code. This can be read as “Within the dataset ExDat, I want descriptive statistics on the variable we measured (Numerical1) grouped by my categorical variable (Categorical), and do it by removing all missing data (NAs in the dataset).

This prints the category name, the number of observations within each category, the mean for your measurevar (“Numerical1”), the standard deviation, standard error, and confidence interval values for each grouping (Category).

#summarySE command will tell it to get summary statistics
          #Put your dataframe name after "data="
                      #the measurevar is the numerical variable you want summary statistics for
                                                #groupvars is the variable that you want your group means divided by (one mean for each category)
                                                                        #remove any missing data from the dataset in order to run
summarySE(data=ExDat, measurevar="Numerical1", groupvars="Categorical", na.rm=T)

##   Categorical  N Numerical1        sd         se        ci
## 1           A 11  0.4063636 0.3079699 0.09285642 0.2068970
## 2           B 10  0.4490000 0.2264435 0.07160773 0.1619879
## 3           C 10  0.6640000 0.3077950 0.09733333 0.2201833
## 4           D 10  0.4740000 0.2985223 0.09440104 0.2135500

When you type it this way, it prints the answer for you immediately, but the results are not saved as a dataframe or in a variable. We sometimes want to save the results as a data.table so that you can reference the values for statistical tests and graphics. To do so, set it equal to something before you run it.

#I am saving the results as "sSE.results"
sSE.results=summarySE(data=ExDat, measurevar="Numerical1", groupvars="Categorical", na.rm=T)

When you do this, it does not automatically print the results. You have to “call” them to view them. You can do that by simply typing the name of the assingment you gave it. You will also see the results in your environment now (top right hand corner of R consule). This is how you know that you can reference it in future commands.

sSE.results

##   Categorical  N Numerical1        sd         se        ci
## 1           A 11  0.4063636 0.3079699 0.09285642 0.2068970
## 2           B 10  0.4490000 0.2264435 0.07160773 0.1619879
## 3           C 10  0.6640000 0.3077950 0.09733333 0.2201833
## 4           D 10  0.4740000 0.2985223 0.09440104 0.2135500

When we ask you to write results summaries, you will reference these numbers to compare the means of different comparison groups. You need these values in addition to the results (T value, F value ,p-value) provided by statistical tests.

You can mathematically compare the means given within the table as well. If I wanted to report the diffence between Numerical2 between Subcategories, I would use the following code. Test yourself! See if you can interpret the line of code below without any annotations provided to you.

sSE.results2=summarySE(data=ExDat, measurevar="Numerical2", groupvars="Subcategory", na.rm=T)
sSE.results2

##   Subcategory  N Numerical2        sd         se        ci
## 1           1 24  0.5300000 0.2842840 0.05802923 0.1200426
## 2           2 17  0.5376471 0.2978366 0.07223598 0.1531334

Self-check Objective 5: In a new code chunk labeled “Summary Statistics”, use the summarySE function as described. Underneath the code chunk, in sentence format, describe what this sSE.results2 calculation is doing.

From here, I can compare the specific means given within the sSE.results2 table. In order to calculate the difference between two groups, you subtract value 2 from value 1, and divide by value 2. Try to think about this logically first. If I todl you that you got an 85 on the first test, and a 93 on the second test, and I wanted to know how much you improved. You would do (93-85)/85. If you do that in R: (93-85)/85, you get 0.094. This would be equivalent to a 9.4% increase in score, which makes sense.

You can follow the same logic here. I want to know the difference in Numerical1 between Subcategory 1 and Subcategory 2. Within my “sSE.results2” table, the mean of Subcategory 1 is in Row 1, Column 3. The notation of that is [1,3]. The Subcategory 2 mean is in Row 2, Column 3 of the table, which is noted as [2,3]. You can reference those locations in your code to subtract the values.

#Take the value in row 1, column 3 of sSE.results2 (mean of Subcategory 1)
                    #Take the value in row 2, column 3 of sSE.results2 (mean of Subcategory 2)
                                      #divide by the value in row 2, column 3 of sSE.results2 (mean of Subcategory 2)
(sSE.results2[1,3]-sSE.results2[2,3])/sSE.results2[2,3]

## [1] -0.01422319

This provides you with the difference in the values on a scale of 0-1. Now you multiply by 100 to get a “percent difference” between the means of the group. So the difference between my groups would be 1.4%. This is equal to the 0.14 * 100. The negative means that subcategory 1 was lower than subcategory 2. You can also see the direction by simply looking at your table to see which mean is lower. So in this case, Numerical1 was 1.4% lower in subcategory 1 than subcategory 2.

Reading in Publicly Available Data

R offers seamless access to publicly available datasets for analysis and visualization. We can use these datasets to practice with! One such classic dataset is the mtcars dataset, which provides data on various attributes of motor cars, such as miles per gallon (mpg), horsepower (hp), and weight (wt), making it an excellent resource for statistical modeling and regression analysis.

Many other datasets can be found here: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html

The mtcars dataset is preloaded in base R, so you don’t need to download it. However, if you’d like to load and work with it programmatically, you can access it as follows:

# Load the mtcars dataset
data(mtcars)

# View the first few rows of the dataset
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

If you downloaded it correctly, it should show you the performance data for a variety of car types. We will be using this and other publicly available datasets throughout the semester, and you will be using one in your wrap up assignment today!

Wrap Up Assignment

Graded Objective:

Create a new coding chunk with the header “Week 3: Wrap Up Assignment”.
Access the publicly available dataset named “ChickWeight”
Report the Structure of the data in a new chunk labeled “Structure” using THREE hashtags to create the subheader.

Underneath this chunk in plain text, explain the structure of each variable.
Change the “Time” variable from numerical to factorial and rerun the structure command to show the change has occurred.

Calculate summary statistics in a new code chunk named “Summary Statistics” using THREE hashtags

Calculate the summary statistics for the weight variable using the “summary” command.
Using SummarySE command, find the average weight per diet category, be sure to remove any missing data. Print the results to a table and describe them in text below the code chunk. Be sure to use the properly formatted statements to describe your results.

How to submit your work:

You are now ready to save the assignment to a PDF Document! Let me walk you through that. This is how you will be turning in your projects this semester.

Go to the top of the window and hit the “Knit” button, making sure it is knitting to PDF

Upload your PDF file to blackboard as your submission file.

Week 5: Introduction to Graphing and Probability Distribution

Let’s all remind ourselves of the proper way to format data for analysis and graphing in R! Data is often recorded in “long” format. Please download the “BodyFat2” excel file off of BlackBoard.

Download the “BodyFat2” excel file.
Transform the data into “long” format so that it can be used in R. Think carfully about what your columns should be in this newly formatted file. Talk within your groups to decide what it should be, and we will discuss as a group before you move forward.
Save the newly formatted data as a CSV file and upload it to your R environment using the “read.csv” command.
Read in the file and name the dataframe “bf”. If you do not remember how to read a file in R go back to the directions in Week 3.

Once your data file is loaded, it should include two columns… one that indicates “Sex” and one that indicates “BodyFatPercentages”. Now your data is ready to use in statistical analysis and for plotting. Today, we are going to focus on plotting using the package ggplot.

ggplot is an incredibly powerful graphics tool. While it can take a bit of work to learn the code, once you learn it, there are very few graphics you can’t make using the same base code.

For more help on the logic of building a graph

This is also helpful

#You will need to install the package if you have not already 
install.packages("ggplot2")

#don't forget to read in the appropriate library!
library(ggplot2)

The first line of code always does the same thing: ggplot(DFname, aes(x=Var, y=Var)). This line uses the command “ggplot” to produce a plot from a specific dataframe (DFname), and you define the X and Y axes of the plot. “aes” means aesthetics - ” deals with the principles of beauty and artistic taste”. When you see “aes”, you are telling ggplot how you want the code to look.

The second line of the code describes the type of plot you are going to make. When you use the code geom_bar() you are telling ggplot that you want it to produce a barplot. When you put geom_bar(stat="summary") you are indicating what the bars will show. The heights of the bars commonly represent one of two things: (1) either a count of cases in each group, or (2) the values in a column of the data frame, such as a representation of mean. By default, geom_bar uses stat=“bin”. This makes the height of each bar equal to the number of cases in each group (#1 above)and will likely produce an error. If you want the heights of the bars to represent values (such as means) in the data, use stat=“summary” and map a value to the y aesthetic. A barplot is one MANY plots you can make with ggplot. The different types of geom_ options include:

Geom Graphic Types

All lines of code following these two are variable depending on what you want the graph to look like. The first two lines are mandatory. You must tell it what data to use, and what type of plot to make. Those lines alone will produce a graph. Everything afterward is to change the appearance of the graph or provide additional information on the plot.

ggplot(bf, aes(x=Sex, y=BodyFatPercentages)) +
  geom_bar(stat="summary")

This would work perfectly fine if you only wanted to make a barplot with no additional information. However, in this lab, we also want you to include indications of error on your barplots. The best way to add that data to your plot is to use the SummarySE package we went over a couple of weeks ago.

Use the SummarySE package to calculate the mean for each sex, and get a measurement of standard error for each sex. Save these results as “sSE.results”. Don’t forget to load the package that runs the SummarySE command first!

library(Rmisc)
sSE.results=summarySE(bf, measurevar="BodyFatPercentages", groupvars="Sex", na.rm = TRUE)

Now go ahead and view your results! You should have a mean, and other summary statistics produced for each sex.

sSE.results

##      Sex  N BodyFatPercentages       sd       se       ci
## 1 Female 10             22.290 5.319660 1.682224 3.805455
## 2   Male  8             15.375 6.368169 2.251488 5.323922

Now we will use this output to create a new ggplot graphic so that we can incorporate the measurements of error onto our plot. Start by reproducing the above plot using your summarySE results rather than the bf dataframe.

ggplot(sSE.results, aes(x=Sex, y=BodyFatPercentages)) +
  geom_bar(stat="summary")

In your code, the third line is used to add error bars to the plot. We ask you to base your error bars on the standard error. If you remember, you found standard error as part of your summarySE output. Lets take a look at mine again:

sSE.results

##      Sex  N BodyFatPercentages       sd       se       ci
## 1 Female 10             22.290 5.319660 1.682224 3.805455
## 2   Male  8             15.375 6.368169 2.251488 5.323922

You can see that I have a column that is labeled “se” and one that is labeled “BodyFatPercentages” and another called “Sex”. Remember to always look at the object you are pulling your data from to ensure that you call the correct variables.

The general code to add errorbars to a graphic is geom_errorbar(aes(ymin=NumVar - se, ymax=NumVar + se), width=0.5). This code tells it to add an errorbar to each bar (each category) that is a certain value below (ymin=NumVar - se) and above (ymax=NumVar + se) the mean for each bar. When you use summarySE, the “se” variable is automatically called that. So you will not have to change it, but you will have to replace “NumVar” with your numerical variable. Think carfully about what variable you want to use here!

If you chose the correct variable, you should have error bars that look like this:

ggplot(sSE.results, aes(x=Sex, y=BodyFatPercentages)) +
  geom_bar(stat="summary") +
  geom_errorbar(aes(ymin=BodyFatPercentages-se, ymax=BodyFatPercentages+se), width=0.5)

The “width=0.5” part of the command tells R how wide to make the errorbars visually. Try changing it from 0.5 to 0.2 and see what happens!

The xlab() and ylab() commands rename your x and y axis to something that makes sense. Because of how we feed data into R, sometimes the way we label variables doesnt make much sense to someone else that isnt familiar with the data. So here, we can rename them! Whatever you put into the parentheses will rename your axis labels, but will not alter the data itself in any way. In this example, I have renamed “BodyFatPercentages” to “Body Fat (%)”. I have also changed the theme of the graphic to change its appearance. You can find more information of ggplot themes here: https://ggplot2.tidyverse.org/reference/ggtheme.html

ggplot(sSE.results, aes(x=Sex, y=BodyFatPercentages)) +
  geom_bar(stat="summary") +
  geom_errorbar(aes(ymin=BodyFatPercentages-se, ymax=BodyFatPercentages+se), width=0.2) +
  theme_classic() +
  ylab("Body Fat (%)")

If you get stuck with ggplot, GOOGLE IT! There are millions of resources available. It is an incredibly popular tool, and someone before has very likely had the same issue. If you google an error, or a goal that you want to meet, there is likely code online to show you how to do it.

Wrap Up Assignment

Graded Objective:

Create a new coding chunk with the header “Week 5: Wrap Up Assignment”.
Access the publicly available dataset named “airquality”. This data represents daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973. The data were obtained from the New York State Department of Conservation (ozone data) and the National Weather Service (meteorological data). The included variables are:

Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island
Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park
Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
Temp: Maximum daily temperature in degrees Fahrenheit at LaGuardia Airport.

Check the structure of your data. If month is numerical in any way, change it so that it is structured as a factor.
Use this information to get summary statistics representing the average ozone in parts per billion for each month.
Create a ggplot barplot using the summary statistics results with error bars at a size of 0.3.
Rename the y-axis to “Ozone (ppb)”.
Change the theme of your ggplot to anything you’d like.
Print your final plot including all components into your PDF file.

How to submit your work:

You are now ready to save the assignment to a PDF Document! Let me walk you through that. This is how you will be turning in your projects this semester.

Go to the top of the window and hit the “Knit” button, making sure it is knitting to PDF. - If you are having trouble knitting as a PDF docuemnt, you can instead knit to HTML. After it produces your HTML webpage, open the file in the software “preview”, then go to File -> Print -> PDF.

Upload your PDF file to blackboard as your submission file.

Week 7: Cladograms in R

Setting the packages and libraries

We will use the package “phangorn” to create some phylogenetic trees using parsimony techniques. So please install the package phangorn and call the library:

#install.packages("ape")
library(ape)
#install.packages("phangorn")
library(phangorn)

If you want to learn more about the package you can type the following:

??`phangorn-package`
vignette("Trees", package="phangorn")

Reading the sequences from a FASTA format

Now let’s get started with the analyses. Please make sure to download the file containing the DNA sequences from blackboard. The file name is “10Sequences.fasta”. Please save this file in the same working directory where you saved your Rmarkdown script.

The following code will read the fasta file into a readable format in R:

plantseq10 <-phangorn::read.phyDat("10Sequences.fasta", format="fasta")

If you want to check that the file was read correctly and obtain some basic description of the file you just call the object that you created above

#this prints the characteristics of the file and DNA sequences
plantseq10

## 10 sequences with 662 character and 294 different site patterns.
## The states are a c g t

Question 1: How many DNA sequences are in the FASTA file, and how many sites are in the sequences?

Making a phylogenetic tree using Maximum parsimony (MP)

To calculate a phylogenetic tree using the maximum parsimony criterion we will use the function “pratchet”. This particular command has lots of possible parameters we can change:

maxit: maximum number of iterations

minit: minimum number of iterations

k: number of rounds of no improvement after which the ratchet is stopped

rearrangements: SPR or NNI rearrangements

trace: how much information to print after each iteration

all: whether to return all equally parsimonius trees, or just one

The parsimony ratchet is an approach that was developed by KC Nixon in 1999. This approach is a more efficient way to find better trees than by just NNI or SPR rearrangements alone. phangorn implements it this way:

Create a bootstrapped dataset from the original dataset (more on bootstrapping later).
Take the current best tree (or starting tree, for the first go-round) and perform tree rearrangements using the bootstrapped dataset, saving the best bootstrap tree.
Use the best bootstrap tree and perform tree rearrangements using the original dataset. Compare the parsimony score (the smallest number of changes necessary to describe the data for a given tree) of the bootstrapped tree and the original best tree. Whichever tree is best (ie, has the lowest parsimony score) is then saved as the new “best” tree.

This process is repeated until either the algorithm reaches the max number of iterations or the k number is reached.

For our analysis, we set the minimum iteration of the parsimony ratchet (minit) to 100 iterations, the default number for k is 10:

treeRatchet  <- phangorn::pratchet(plantseq10, trace = 0, minit=100)

#We assign branch lengths to the tree equal to the number of changes
treeRatchet <- phangorn::acctran(treeRatchet,plantseq10)

#obtaining description of the tree
treeRatchet

## 
## Phylogenetic tree with 10 tips and 8 internal nodes.
## 
## Tip labels:
##   Kalanchoe_marmorata      , Aeonium_arboreum         , Crassula_muscosa         , Crassula_falcata         , Kalanchoe_pinnata        , Kalanchoe_delagoensis    , ...
## Node labels:
##   1, 0.69, 1, 0.98, 1, 1, ...
## 
## Unrooted; includes branch length(s).

Question 2: How many tips and nodes are in the tree?

Now let’s plot the tree:

plot(treeRatchet, main="Parsimony tree") #plotting the tree

Calculating the parsimony score

#the following gives you the overall parsimony score of the best tree
phangorn::parsimony(treeRatchet, plantseq10)

## [1] 458

Question 3: What was the parsimony score for the tree constructed using DNA sequences?

Making the tree with bootstrapping

It would be nice to have a way to measure the robustness of each individual branch and clade, not just the overall tree. In phylogenetics, the measurement of choice for determining the “strength” of a branch is the bootstrap value.

Bootstrapping is a resampling method that attempts to use the original data as a way to judge the strength of phylogenetic inference. For each bootstrap replicate, the program will randomly select a number of sites to create a pseudoalignment. (For example, if the original alignment has 100 bases, the bootstrap algorithm will randomly choose 100 bases to create a new alignment. Some bases might be chosen multiple times, while other bases don’t get sampled). After generating the pseudoalignment, a new tree is built and the relationships are stored.

After all the bootstrapping replicates have been analyzed, each branch of the actual tree will be labeled with a value that reflects how frequently that branch was seen among the replicate trees. Higher values mean the branch has more support. In general, researchers consider a branch with a bootstrap value > 0.5 as well-supported.

Luckily, the “pratchet” command automatically does bootstrapping, so we already have the bootstrap information saved. We can see the bootstrap values on the trees using the “plotBS” command:

phangorn::plotBS(treeRatchet, type = "phylogram", main = 'Parsimony with Bootstrap values')

Question 4: What do the numerical values associated with the branches of the tree mean?

If you want to save your tree to open it in the software FigTree you need to add the tree into an object that we will call “my10seqTree”, and then use the function “write.tree” to save the tree:

#adding tree with bootstrap values into an object
my10seqTree <- phangorn::plotBS(treeRatchet, type = "phylogram", main = 'Parsimony with Bootstrap values')

#saving the tree in newick format:
ape::write.tree(my10seqTree, file = '10seqTree.tre')

If you want to print a prettier tree you can now open the tree using FigTree. Try to open your saved tree in FigTree. When do you this, FigTree will ask you what are the numbers in the nodes, just say “Bootstrap” (see figure below):

Figure 1. Opening Warning of FigTree to label Bootstrap values

To make sure that the bootstrap values are diplayed in FigTree, got to the right side and click “node labels”

Figure 2.Select Node labels

Now,open the arrow and in the “Display” option select “Bootstrap”. This should change the node labels to represent the bootstrap values

Figure 3.Select Bootstrap

Question 5: What do these groupings indicate about the relatedness of the DNA samples? Please describe the relationships among the different plant species based on the tree with bootstrap values

Question 6: How does this DNA tree compare to the morphological phylogeny using cladistics that you developed for the same 10 species in part B?

Class Activity:Please perform the same analyses but using the 20sequences.fasta file

Read the 20sequence FASTA file

#reading the file
plantseq20 <-phangorn::read.phyDat("20Sequences.fasta", format="fasta")
#checking the file
plantseq20

## 20 sequences with 828 character and 448 different site patterns.
## The states are a c g t

Making a phylogenetic tree using Maximum parsimony (MP) for the 20 DNA sequences

treeRatchet2  <- phangorn::pratchet(plantseq20, trace = 0, minit=100)

#We assign branch lengths to the tree equal to the number of changes
treeRatchet2 <- phangorn::acctran(treeRatchet2,plantseq20)

#obtaining description of the tree
treeRatchet2

## 
## Phylogenetic tree with 20 tips and 18 internal nodes.
## 
## Tip labels:
##   Aeonium_arboreum         , Aeonium_decorum          , Adromischus_maculatus    , Adromischus_cristatus    , Kalanchoe_schizophylla   , Kalanchoe_pinnata        , ...
## Node labels:
##   1, 1, 1, 1, 0.73, 0.81, ...
## 
## Unrooted; includes branch length(s).

Calculating the parsimony score for the 20 sequence tree

#the following gives you the overall parsimony score of the best tree
phangorn::parsimony(treeRatchet2, plantseq20)

## [1] 641

Ploting tree with bootstrap for the 20 sequence analysis

phangorn::plotBS(treeRatchet2, type = "phylogram", main = 'Parsimony with Bootstrap values')

If you want to save your tree you need to add the tree into an object that we will call “my20seqTree”, and then use the function “write.tree” to save the tree:

#adding tree with bootstrap values into an object
my20seqTree <- phangorn::plotBS(treeRatchet2, type = "phylogram", main = 'Parsimony with Bootstrap values')

#saving the tree in newick format:
ape::write.tree(my20seqTree, file = '10seqTree.tre')

Week 8: T-Tests in R

Introduction

In this exercise, we will perform a t-test to compare the mean plant heights between the Control and Treatment groups. The goal is to determine whether the difference in plant heights between these two groups is statistically significant.

Step 1: Enter Your Data in Excel

Open Excel and create a new spreadsheet.
In Column A, label it as “Group”. Enter the group (either “Control” or “Treatment”) for each data point.
In Column B, label it as “Height”. Enter the plant height measurements in centimeters.

Step 2: Save Your Data as a CSV File

After entering your data, save your spreadsheet as a CSV file:
- Click on File > Save As.
Select CSV (Comma delimited) (*.csv) from the “Save as type” dropdown.
Save the file in a folder called “T.Test POB Lab” and name it something like plant_height_data.csv.

Step 3: Load Your Data into R

Set your working directory in RStudio to the folder where you saved the CSV file. You can do this by navigating to Session > Set Working Directory > Choose Directory.

Then, run the following R code to load the data:

# Load the data from the CSV file into R
plant_data <- read.csv("plant_height_data.csv")

# View the first few rows of the data to confirm it's loaded correctly
head(plant_data)

##     Group Height
## 1 Control     25
## 2 Control     30
## 3 Control     28
## 4 Control     22
## 5 Control     27
## 6 Control     26

Step 4: Perform the T-Test

Now that your data is loaded, we can perform the t-test to compare the Control and Treatment groups:

# Perform the two-sample t-test comparing the Control and Treatment groups
t_test_result <- t.test(Height ~ Group, data = plant_data, var.equal = TRUE)

# Display the results of the t-test
print(t_test_result)

## 
##  Two Sample t-test
## 
## data:  Height by Group
## t = -10.858, df = 18, p-value = 2.478e-09
## alternative hypothesis: true difference in means between group Control and group Treatment is not equal to 0
## 95 percent confidence interval:
##  -15.51526 -10.48474
## sample estimates:
##   mean in group Control mean in group Treatment 
##                    26.5                    39.5

The output will include the t-value, degrees of freedom (df), p-value, and the confidence interval for the difference in means. Here’s how to interpret the results:

t-value: The larger the absolute t-value, the greater the difference between group means, suggesting a stronger effect. Degrees of Freedom (df): The higher the degrees of freedom, the more reliable the estimate of the difference. p-value: If p-value ≤ 0.05, the difference is statistically significant. If p-value > 0.05, the difference is not statistically significant. Confidence Interval: If the confidence interval does not include 0, it indicates a significant difference between the groups.

Step 5: Visualize the Data

It’s always helpful to visualize the data to understand the distribution. Below, we will create a box plot to compare the height distributions between the two groups.

# Load ggplot2 library for visualization
library(ggplot2)

# Create a box plot to visualize the height differences
ggplot(plant_data, aes(x = Group, y = Height, fill = Group)) +
  geom_boxplot() +
  labs(x = "Group", y = "Height (cm)") +
  theme_classic() +
  scale_fill_manual(values = c("Control" = "darkred", "Treatment" = "darkblue"))

Graded Objective:

Create a new coding chunk with the header “Week 8: Wrap Up Assignment”.
Perform T-tests:

Use R to conduct a t-test on your plant growth data for the three variables: Dry Biomass, Height, and Number of Buds.
For each variable, compare the means of two groups (e.g., control vs. treatment).

Create a Bar Graph with Error Bars:

Using the summarySE function from the Rmisc package, compute the standard error of the mean and create a bar graph.
The graph should include error bars that represent the standard error.

Figure Caption:

Write a figure caption that includes:
The sample size for each group.
A brief description of the error bars (i.e., standard error of the mean).
An indication of whether the difference is statistically significant based on the p-value from your t-test (e.g., p < 0.05 indicates statistical significance).

Results Section:

Write a results section summarizing the outcomes of each t-test, including:
The t-value, degrees of freedom (df), and p-value for each test.
Example of reporting results: “There was a significant difference in dry biomass between the control and treatment groups (t = 2.46, df = 18, p = 0.027).” Follow this format for each variable.

Print your final plot including all components into your PDF file.

Example Figure Caption and Results Section:

Figure Caption:

“Bar graph showing the mean (± SE) dry biomass for control and treatment groups. The sample size was n = 10 for each group. Error bars represent the standard error of the mean. A statistically significant difference between the groups was observed (p = 0.027).”

Results Section:

“There was a significant difference in dry biomass between the control and treatment groups (t = 2.46, df = 18, p = 0.027). Similarly, height measurements revealed no significant difference between the groups (t = 1.12, df = 18, p = 0.274). The number of buds was significantly greater in the treatment group (t = 3.85, df = 18, p = 0.002).”

How to submit your work:

You are now ready to save the assignment to a PDF Document! Let me walk you through that. This is how you will be turning in your projects this semester.

Go to the top of the window and hit the “Knit” button, making sure it is knitting to PDF. - If you are having trouble knitting as a PDF docuemnt, you can instead knit to HTML. After it produces your HTML webpage, open the file in the software “preview”, then go to File -> Print -> PDF.

Submit your R Markdown PDF and the written report (including figure caption and results section).

UNDER BUILDING

Biology Department

2025-03-03

Welcome to Principles of Biology 2

Week 3: Intro to R Programming

Setting up R and RStudio

Setting up R Markdown

Formatting and Reading in Data from a CSV File

Producing Summary Statistics

Reading in Publicly Available Data

Wrap Up Assignment

How to submit your work:

Week 5: Introduction to Graphing and Probability Distribution

Wrap Up Assignment

How to submit your work:

Week 7: Cladograms in R

Setting the packages and libraries

Reading the sequences from a FASTA format

Making a phylogenetic tree using Maximum parsimony (MP)

Calculating the parsimony score

Making the tree with bootstrapping

Class Activity:Please perform the same analyses but using the 20sequences.fasta file

Read the 20sequence FASTA file

Making a phylogenetic tree using Maximum parsimony (MP) for the 20 DNA sequences

Calculating the parsimony score for the 20 sequence tree

Ploting tree with bootstrap for the 20 sequence analysis

Week 8: T-Tests in R

Introduction

Step 1: Enter Your Data in Excel

Step 2: Save Your Data as a CSV File

Step 3: Load Your Data into R

Step 4: Perform the T-Test

Step 5: Visualize the Data

Graded Objective:

How to submit your work: