This manual is used to guide your coding journey as you learn how to use R programming
If you have not previously done so, go to this webpage and download R: http://cran.r-project.org/
Then go to this webpage and download Rstudio: http://www.rstudio.com/products/RStuido
Once you have done this, open RStudio.
Here are some helpful resources in aiding you to understand R Markdown.
Video Tutorial: https://rmarkdown.rstudio.com/authoring_quick_tour.html
Help Page: https://rmarkdown.rstudio.com/lesson-15.HTML
Here is a PDF version that I find really helpful! https://posit.co/wp-content/uploads/2022/10/rmarkdown-1.pdf
#Run the following code in your console (directly in the code running area) to download the package that allows you to create R markdown files. This code will download the package if you do not have it, and skip the download if you have already done so previously.
if (!require("rmarkdown")) {
install.packages("rmarkdown", dependencies = TRUE)
library(rmarkdown)
}
if (!require("knitr")) {
install.packages("knitr")
library(knitr)
}
if (!require("tinytex")) {
install.packages("tinytex")
library(tinytex)
tinytex::install_tinytex() # Install TinyTeX distribution
}Replace the “Untitled1” title with “YOURLASTNAME POB2 LAB - SECTION XX”.
Click the little disk to save the file, and name it the same thing. Be very selective of where you save the file, as all of your data will need to be stored in the same location. I recommend creating a folder on your desktop for this course, saving this file there, and exclusively using that folder moving forward. And remember to save your work often.
Place the following code in the file right under the header, but remove the leading hash tags. I have to put those in for the code to display to you. Any code that you write has to go between these two lines, or it will not run or display to me. Anything you write outside of this is interpreted as plain text instead of code. This means that outside of this region, you can write notes, or write me messages just like you would in a word document. You do not need hashtags. It can allow you to keep really neat files (see the help PDF for organization tricks).
The very first time you use a package, get a new computer, or update R, you will need to reinstall your packages to the R program. The general code for installing a package is
install.packages(). An example can be seen below. Notice that you do need quotation marks around the package name at this stage.
Please note that the text in GREEN will need to be changed by you to list the proper package name.
After you install a package, or restart R, you will need to load the library. Note that this must be done each time you open R, not just after installing the package. This is done with the general code
library(). Notice you do not need the quotations around the package name at this stage.
Self-check Objective 1: Have a code chunk set up in which you load all the assigned packages. Name this code chunk “Loading Packages”. To give the chunk a title, write “# Loading Packages” above the code chunk. The hashtag tells it to create a title for the chunk. Try to include “tidyverse”, “dplyr”, and “ggplot2” for this self-check.
If you forget to read in the library of a package before you run a command that requires that package, you will get an error similar to this one: “Error in ggplot(): could not find function”ggplot”.
When you get an error that states it cannot find the function, you want to check two things (1) you loaded the library for the function and (2) there are no typos in the function name. If you are not sure which packages/library holds the function, you can find out by using the command
This will print the help page for the function. At the very top of the page you will see the function and package written as “function {package name}”. This page also includes helpful information on how to format the code for the function and examples of how to write it.
Self-check Objective 2: In a new code chunk labeled “Function Search”, search the command function for “read.csv” and report its description as well as its proper formatting as plain text below your code chunk.
This is the next thing that we will be doing! You will be reading in some example data in CSV format, and using that data to learn how to edit data in R.
We often give you data that is in “wide” format. R handles data that is in “long” format more efficiently for the analyses that you do.
For more information on wide vs long data, check this out>
Make sure you data is in long format before you try to read it into R. This means that each column of your dataset should contain a single variable, and one variable should not be spread out over multiple columns. If the data is in the proper format, we use this code to read in the data:
Rdataname=read.csv("filename.csv", header =T). An example is shown below.
#You can name the data anything you'd like. But you are going to reference it many times as you move forward.
#So it should be easy to type and intuitive. In this case, I named mine "ExDat" because it is the example dataset.
#We use "read.csv" because it is a CSV file type. There are other options, but this is the one we use.
ExDat=read.csv("ExDataFile.csv", header =T)
#This file name must match EXACTLY as it appears in your folder.
# header=T tells R that the very first row of your excel CSV file actually is your variable names, and not a data point.In order to read in your data, the downloaded data MUST be in the same folder in which your RMarkdown is saved. Go to BlackBoard, and download the “ExDataFile.csv” file. You want to move this from your downloads into the folder in which you have your RMarkdown file (this one) saved or you will get an error that looks like this:
Error in file(file, “rt”) : cannot open the connection In addition: Warning message: In file(file, “rt”) : cannot open file ‘ExDataFile.csv’: No such file or directory
Some common misteps here are misusing the
header=Tpart of the code or forgetting the “.csv” at the end of the file name. So watch out for those potential issues. If your code has run properly, you should see the dataframe listed in your environment (top right hand corner of the screen), and the following code should let you preview your data:
#please note that you will have to change "ExDat" to whatever you named your specific dataframe
head(ExDat)## Categorical Subcategory Numerical1 Numerical2 Integer
## 1 A 1 0.08 0.25 1
## 2 B 2 0.10 0.41 2
## 3 C 1 0.10 0.66 1
## 4 D 2 0.53 0.04 4
## 5 A 1 0.00 0.69 2
## 6 B 1 0.54 0.73 4
Sometimes the way that we enter data in excel can be misleading for R. Remember, R is simply a computer program. It interprets the data and code exactly as it is written, sometimes to a fault. So before you move on, you want to check and make sure that the data is all in the proper format (per column). For example, certain statistical tests and graphs require either categorical or continuous data. So you want to be sure that your data is being interpreted correctly. In order to check the “class” assigned to each variable, you can use the code
str(), which prints the “structure” of the dataframe.
#please note that you will have to change "ExDat" to whatever you named your specific dataframe
str(ExDat)## 'data.frame': 41 obs. of 5 variables:
## $ Categorical: chr "A" "B" "C" "D" ...
## $ Subcategory: int 1 2 1 2 1 1 2 1 2 1 ...
## $ Numerical1 : num 0.08 0.1 0.1 0.53 0 0.54 0.6 0.23 0.4 0.67 ...
## $ Numerical2 : num 0.25 0.41 0.66 0.04 0.69 0.73 0.73 0.87 0.38 0.72 ...
## $ Integer : int 1 2 1 4 2 4 5 2 5 3 ...
Self-check Objective 3: In a new code chunk labeled “Structure Identification”, identify the data structure of each variable.
At this point, the Subcategory column is being read as an “int” or integer. We want it to be a categorical variable. In order to change from one class to another, we can use the general code format of
DFname$Variable=as.class(DFname$Variable)where you enter your dataframe name as the “DFname”, the variable you want to change as the “Variable”, and you put the type of data you want it to be classified as in the “as.class” portion. An example can be seen below:
#I want to change the variables "category" and "subcategory" in the dataframe "ExDat" to a factor (categorical) based variable. This code tells it to look at that variable, change it to a factor, and overwrite the values in the original dataframe with the new factor based values.
ExDat$Categorical=as.factor(ExDat$Categorical)
ExDat$Subcategory=as.factor(ExDat$Subcategory)If you want to go to/from other data types, you can start to type
as. and the list of class options will pop-up for you to
choose from. If you are not sure which one to use, you can always double
check by using the ? before the function to read the help
summary. Some commonly used options are: 1. as.factor() 2. as.numeric()
3. as.integer() 4. as.character()
For more information on R variable types and how to choose the right one, check this out
After recategorizing a variable, you want to double check that it worked by looking at the structure of your data again.
#please note that you will have to change "ExDat" to whatever you named your specific dataframe
str(ExDat)## 'data.frame': 41 obs. of 5 variables:
## $ Categorical: Factor w/ 4 levels "A","B","C","D": 1 2 3 4 1 2 3 4 1 2 ...
## $ Subcategory: Factor w/ 2 levels "1","2": 1 2 1 2 1 1 2 1 2 1 ...
## $ Numerical1 : num 0.08 0.1 0.1 0.53 0 0.54 0.6 0.23 0.4 0.67 ...
## $ Numerical2 : num 0.25 0.41 0.66 0.04 0.69 0.73 0.73 0.87 0.38 0.72 ...
## $ Integer : int 1 2 1 4 2 4 5 2 5 3 ...
You can see that the data is now set up with “Category” and “Subcategory” as factor based variables, with “Numerical1” and “Numerical2” as “num” or numerical variables. Integer data remains an “int” or integer. We are now ready to move on.
Self-check Objective 4: In a new code chunk labeled “Structure Reclassified” include the code showing that you were able to restructure the variable as instructed above.
There are a number of reasons we produce summary statistics. The first is simply that it gives us a quick view into how our data might look. The second is that when we run statistical tests, they often produce p-values and test statistics that aren’t particularly intuitive. We needs descriptive statistics of the data in order to explain the results to someone in a way they will understand.
One way to get a quick view of summary statistics is by using the
summary()command. Using this command, you will get a results print out showing summary statistics for each column in your dataframe. For continuous data (numerical and integer) you will get the Mean, Median, Min, Max, and Quantiles for each column. For categorical data, you will get the number of observations within each group.
#please note that you will have to change "ExDat" to whatever you named your specific dataframe
summary(ExDat)## Categorical Subcategory Numerical1 Numerical2 Integer
## A:11 1:24 Min. :0.0000 Min. :0.0400 Min. :1.000
## B:10 2:17 1st Qu.:0.2300 1st Qu.:0.2900 1st Qu.:2.000
## C:10 Median :0.5300 Median :0.5300 Median :3.000
## D:10 Mean :0.4961 Mean :0.5332 Mean :3.049
## 3rd Qu.:0.7200 3rd Qu.:0.7300 3rd Qu.:4.000
## Max. :0.9800 Max. :0.9900 Max. :5.000
However, when you have large dataframes, sometimes this is way more detail that you want or need. Sometimes you just want summary statistics for a specific column. To do that, you can specify a variable:
summary(DFname$Variable)orclass(DFname$Variable)
#please note that you will have to change "ExDat" to whatever you named your specific dataframe and Numerical1 to whatever variable you want descriptive statistics for
summary(ExDat$Numerical1)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2300 0.5300 0.4961 0.7200 0.9800
## [1] "numeric"
You will notice that this provided a single mean for the whole column of “Numerical1”. But what if I want to know the mean for each Category? In that case, we use a specialized command called
SummarySE()that is housed within the Rmics package. You will need to install and load the package as we discussed before. I recommend that you add it to your list of packages to be loaded each time you open R.
#Remember, you only need to install the package the first time you use it, if you get a new computer, or if you update R.
install.packages("Rmisc")You can now setup the code. This can be read as “Within the dataset ExDat, I want descriptive statistics on the variable we measured (Numerical1) grouped by my categorical variable (Categorical), and do it by removing all missing data (NAs in the dataset).
This prints the category name, the number of observations within each category, the mean for your measurevar (“Numerical1”), the standard deviation, standard error, and confidence interval values for each grouping (Category).
#summarySE command will tell it to get summary statistics
#Put your dataframe name after "data="
#the measurevar is the numerical variable you want summary statistics for
#groupvars is the variable that you want your group means divided by (one mean for each category)
#remove any missing data from the dataset in order to run
summarySE(data=ExDat, measurevar="Numerical1", groupvars="Categorical", na.rm=T)## Categorical N Numerical1 sd se ci
## 1 A 11 0.4063636 0.3079699 0.09285642 0.2068970
## 2 B 10 0.4490000 0.2264435 0.07160773 0.1619879
## 3 C 10 0.6640000 0.3077950 0.09733333 0.2201833
## 4 D 10 0.4740000 0.2985223 0.09440104 0.2135500
When you type it this way, it prints the answer for you immediately, but the results are not saved as a dataframe or in a variable. We sometimes want to save the results as a data.table so that you can reference the values for statistical tests and graphics. To do so, set it equal to something before you run it.
#I am saving the results as "sSE.results"
sSE.results=summarySE(data=ExDat, measurevar="Numerical1", groupvars="Categorical", na.rm=T)When you do this, it does not automatically print the results. You have to “call” them to view them. You can do that by simply typing the name of the assingment you gave it. You will also see the results in your environment now (top right hand corner of R consule). This is how you know that you can reference it in future commands.
## Categorical N Numerical1 sd se ci
## 1 A 11 0.4063636 0.3079699 0.09285642 0.2068970
## 2 B 10 0.4490000 0.2264435 0.07160773 0.1619879
## 3 C 10 0.6640000 0.3077950 0.09733333 0.2201833
## 4 D 10 0.4740000 0.2985223 0.09440104 0.2135500
When we ask you to write results summaries, you will reference these numbers to compare the means of different comparison groups. You need these values in addition to the results (T value, F value ,p-value) provided by statistical tests.
You can mathematically compare the means given within the table as well. If I wanted to report the diffence between Numerical2 between Subcategories, I would use the following code. Test yourself! See if you can interpret the line of code below without any annotations provided to you.
sSE.results2=summarySE(data=ExDat, measurevar="Numerical2", groupvars="Subcategory", na.rm=T)
sSE.results2## Subcategory N Numerical2 sd se ci
## 1 1 24 0.5300000 0.2842840 0.05802923 0.1200426
## 2 2 17 0.5376471 0.2978366 0.07223598 0.1531334
Self-check Objective 5: In a new code chunk labeled “Summary Statistics”, use the summarySE function as described. Underneath the code chunk, in sentence format, describe what this sSE.results2 calculation is doing.
From here, I can compare the specific means given within the sSE.results2 table. In order to calculate the difference between two groups, you subtract value 2 from value 1, and divide by value 2. Try to think about this logically first. If I todl you that you got an 85 on the first test, and a 93 on the second test, and I wanted to know how much you improved. You would do (93-85)/85. If you do that in R:
(93-85)/85, you get 0.094. This would be equivalent to a 9.4% increase in score, which makes sense.
You can follow the same logic here. I want to know the difference in Numerical1 between Subcategory 1 and Subcategory 2. Within my “sSE.results2” table, the mean of Subcategory 1 is in Row 1, Column 3. The notation of that is [1,3]. The Subcategory 2 mean is in Row 2, Column 3 of the table, which is noted as [2,3]. You can reference those locations in your code to subtract the values.
#Take the value in row 1, column 3 of sSE.results2 (mean of Subcategory 1)
#Take the value in row 2, column 3 of sSE.results2 (mean of Subcategory 2)
#divide by the value in row 2, column 3 of sSE.results2 (mean of Subcategory 2)
(sSE.results2[1,3]-sSE.results2[2,3])/sSE.results2[2,3]## [1] -0.01422319
This provides you with the difference in the values on a scale of 0-1. Now you multiply by 100 to get a “percent difference” between the means of the group. So the difference between my groups would be 1.4%. This is equal to the
0.14 * 100. The negative means that subcategory 1 was lower than subcategory 2. You can also see the direction by simply looking at your table to see which mean is lower. So in this case, Numerical1 was 1.4% lower in subcategory 1 than subcategory 2.
R offers seamless access to publicly available datasets for analysis and visualization. We can use these datasets to practice with! One such classic dataset is the mtcars dataset, which provides data on various attributes of motor cars, such as miles per gallon (mpg), horsepower (hp), and weight (wt), making it an excellent resource for statistical modeling and regression analysis.
Many other datasets can be found here: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html
The mtcars dataset is preloaded in base R, so you don’t need to download it. However, if you’d like to load and work with it programmatically, you can access it as follows:
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
If you downloaded it correctly, it should show you the performance data for a variety of car types. We will be using this and other publicly available datasets throughout the semester, and you will be using one in your wrap up assignment today!
Graded Objective:
Let’s all remind ourselves of the proper way to format data for analysis and graphing in R! Data is often recorded in “long” format. Please download the “BodyFat2” excel file off of BlackBoard.
Once your data file is loaded, it should include two columns… one that indicates “Sex” and one that indicates “BodyFatPercentages”. Now your data is ready to use in statistical analysis and for plotting. Today, we are going to focus on plotting using the package ggplot.
ggplot is an incredibly powerful graphics tool. While it can take a bit of work to learn the code, once you learn it, there are very few graphics you can’t make using the same base code.
For more help on the logic of building a graph
The first line of code always does the same thing:
ggplot(DFname, aes(x=Var, y=Var)). This line uses the command “ggplot” to produce a plot from a specific dataframe (DFname), and you define the X and Y axes of the plot. “aes” means aesthetics - ” deals with the principles of beauty and artistic taste”. When you see “aes”, you are telling ggplot how you want the code to look.
The second line of the code describes the type of plot you are going to make. When you use the code
geom_bar()you are telling ggplot that you want it to produce a barplot. When you putgeom_bar(stat="summary")you are indicating what the bars will show. The heights of the bars commonly represent one of two things: (1) either a count of cases in each group, or (2) the values in a column of the data frame, such as a representation of mean. By default, geom_bar uses stat=“bin”. This makes the height of each bar equal to the number of cases in each group (#1 above)and will likely produce an error. If you want the heights of the bars to represent values (such as means) in the data, use stat=“summary” and map a value to the y aesthetic. A barplot is one MANY plots you can make with ggplot. The different types ofgeom_options include:
All lines of code following these two are variable depending on what you want the graph to look like. The first two lines are mandatory. You must tell it what data to use, and what type of plot to make. Those lines alone will produce a graph. Everything afterward is to change the appearance of the graph or provide additional information on the plot.
This would work perfectly fine if you only wanted to make a barplot with no additional information. However, in this lab, we also want you to include indications of error on your barplots. The best way to add that data to your plot is to use the SummarySE package we went over a couple of weeks ago.
Use the SummarySE package to calculate the mean for each sex, and get a measurement of standard error for each sex. Save these results as “sSE.results”. Don’t forget to load the package that runs the SummarySE command first!
library(Rmisc)
sSE.results=summarySE(bf, measurevar="BodyFatPercentages", groupvars="Sex", na.rm = TRUE)Now go ahead and view your results! You should have a mean, and other summary statistics produced for each sex.
## Sex N BodyFatPercentages sd se ci
## 1 Female 10 22.290 5.319660 1.682224 3.805455
## 2 Male 8 15.375 6.368169 2.251488 5.323922
Now we will use this output to create a new ggplot graphic so that we can incorporate the measurements of error onto our plot. Start by reproducing the above plot using your summarySE results rather than the bf dataframe.
In your code, the third line is used to add error bars to the plot. We ask you to base your error bars on the standard error. If you remember, you found standard error as part of your
summarySEoutput. Lets take a look at mine again:
## Sex N BodyFatPercentages sd se ci
## 1 Female 10 22.290 5.319660 1.682224 3.805455
## 2 Male 8 15.375 6.368169 2.251488 5.323922
You can see that I have a column that is labeled “se” and one that is labeled “BodyFatPercentages” and another called “Sex”. Remember to always look at the object you are pulling your data from to ensure that you call the correct variables.
The general code to add errorbars to a graphic is
geom_errorbar(aes(ymin=NumVar - se, ymax=NumVar + se), width=0.5). This code tells it to add an errorbar to each bar (each category) that is a certain value below (ymin=NumVar - se) and above (ymax=NumVar + se) the mean for each bar. When you use summarySE, the “se” variable is automatically called that. So you will not have to change it, but you will have to replace “NumVar” with your numerical variable. Think carfully about what variable you want to use here!
If you chose the correct variable, you should have error bars that look like this:
ggplot(sSE.results, aes(x=Sex, y=BodyFatPercentages)) +
geom_bar(stat="summary") +
geom_errorbar(aes(ymin=BodyFatPercentages-se, ymax=BodyFatPercentages+se), width=0.5)The “width=0.5” part of the command tells R how wide to make the errorbars visually. Try changing it from 0.5 to 0.2 and see what happens!
The
xlab()andylab()commands rename your x and y axis to something that makes sense. Because of how we feed data into R, sometimes the way we label variables doesnt make much sense to someone else that isnt familiar with the data. So here, we can rename them! Whatever you put into the parentheses will rename your axis labels, but will not alter the data itself in any way. In this example, I have renamed “BodyFatPercentages” to “Body Fat (%)”. I have also changed the theme of the graphic to change its appearance. You can find more information of ggplot themes here: https://ggplot2.tidyverse.org/reference/ggtheme.html
ggplot(sSE.results, aes(x=Sex, y=BodyFatPercentages)) +
geom_bar(stat="summary") +
geom_errorbar(aes(ymin=BodyFatPercentages-se, ymax=BodyFatPercentages+se), width=0.2) +
theme_classic() +
ylab("Body Fat (%)") If you get stuck with ggplot, GOOGLE IT! There are millions of resources available. It is an incredibly popular tool, and someone before has very likely had the same issue. If you google an error, or a goal that you want to meet, there is likely code online to show you how to do it.
Graded Objective:
We will use the package “phangorn” to create some phylogenetic trees using parsimony techniques. So please install the package phangorn and call the library:
If you want to learn more about the package you can type the following:
Now let’s get started with the analyses. Please make sure to download the file containing the DNA sequences from blackboard. The file name is “10Sequences.fasta”. Please save this file in the same working directory where you saved your Rmarkdown script.
The following code will read the fasta file into a readable format in R:
If you want to check that the file was read correctly and obtain some basic description of the file you just call the object that you created above
## 10 sequences with 662 character and 294 different site patterns.
## The states are a c g t
Question 1: How many DNA sequences are in the FASTA file, and how many sites are in the sequences?
To calculate a phylogenetic tree using the maximum parsimony criterion we will use the function “pratchet”. This particular command has lots of possible parameters we can change:
maxit: maximum number of iterations
minit: minimum number of iterations
k: number of rounds of no improvement after which the ratchet is stopped
rearrangements: SPR or NNI rearrangements
trace: how much information to print after each iteration
all: whether to return all equally parsimonius trees, or just one
The parsimony ratchet is an approach that was developed by KC Nixon in 1999. This approach is a more efficient way to find better trees than by just NNI or SPR rearrangements alone. phangorn implements it this way:
Create a bootstrapped dataset from the original dataset (more on bootstrapping later).
Take the current best tree (or starting tree, for the first go-round) and perform tree rearrangements using the bootstrapped dataset, saving the best bootstrap tree.
Use the best bootstrap tree and perform tree rearrangements using the original dataset. Compare the parsimony score (the smallest number of changes necessary to describe the data for a given tree) of the bootstrapped tree and the original best tree. Whichever tree is best (ie, has the lowest parsimony score) is then saved as the new “best” tree.
This process is repeated until either the algorithm reaches the max number of iterations or the k number is reached.
For our analysis, we set the minimum iteration of the parsimony ratchet (minit) to 100 iterations, the default number for k is 10:
treeRatchet <- phangorn::pratchet(plantseq10, trace = 0, minit=100)
#We assign branch lengths to the tree equal to the number of changes
treeRatchet <- phangorn::acctran(treeRatchet,plantseq10)
#obtaining description of the tree
treeRatchet##
## Phylogenetic tree with 10 tips and 8 internal nodes.
##
## Tip labels:
## Kalanchoe_marmorata , Aeonium_arboreum , Crassula_muscosa , Crassula_falcata , Kalanchoe_pinnata , Kalanchoe_delagoensis , ...
## Node labels:
## 1, 0.69, 1, 0.98, 1, 1, ...
##
## Unrooted; includes branch length(s).
Question 2: How many tips and nodes are in the tree?
Now let’s plot the tree:
#the following gives you the overall parsimony score of the best tree
phangorn::parsimony(treeRatchet, plantseq10)## [1] 458
Question 3: What was the parsimony score for the tree constructed using DNA sequences?
It would be nice to have a way to measure the robustness of each individual branch and clade, not just the overall tree. In phylogenetics, the measurement of choice for determining the “strength” of a branch is the bootstrap value.
Bootstrapping is a resampling method that attempts to use the original data as a way to judge the strength of phylogenetic inference. For each bootstrap replicate, the program will randomly select a number of sites to create a pseudoalignment. (For example, if the original alignment has 100 bases, the bootstrap algorithm will randomly choose 100 bases to create a new alignment. Some bases might be chosen multiple times, while other bases don’t get sampled). After generating the pseudoalignment, a new tree is built and the relationships are stored.
After all the bootstrapping replicates have been analyzed, each branch of the actual tree will be labeled with a value that reflects how frequently that branch was seen among the replicate trees. Higher values mean the branch has more support. In general, researchers consider a branch with a bootstrap value > 0.5 as well-supported.
Luckily, the “pratchet” command automatically does bootstrapping, so we already have the bootstrap information saved. We can see the bootstrap values on the trees using the “plotBS” command:
Question 4: What do the numerical values associated with the branches of the tree mean?
If you want to save your tree to open it in the software FigTree you need to add the tree into an object that we will call “my10seqTree”, and then use the function “write.tree” to save the tree:
#adding tree with bootstrap values into an object
my10seqTree <- phangorn::plotBS(treeRatchet, type = "phylogram", main = 'Parsimony with Bootstrap values')Figure 1. Opening Warning of FigTree to label Bootstrap values
To make sure that the bootstrap values are diplayed in FigTree, got to the right side and click “node labels”
Figure 2.Select Node labels
Now,open the arrow and in the “Display” option select “Bootstrap”. This should change the node labels to represent the bootstrap values
Figure 3.Select Bootstrap
Question 5: What do these groupings indicate about the relatedness of the DNA samples? Please describe the relationships among the different plant species based on the tree with bootstrap values
Question 6: How does this DNA tree compare to the morphological phylogeny using cladistics that you developed for the same 10 species in part B?
#reading the file
plantseq20 <-phangorn::read.phyDat("20Sequences.fasta", format="fasta")
#checking the file
plantseq20## 20 sequences with 828 character and 448 different site patterns.
## The states are a c g t
treeRatchet2 <- phangorn::pratchet(plantseq20, trace = 0, minit=100)
#We assign branch lengths to the tree equal to the number of changes
treeRatchet2 <- phangorn::acctran(treeRatchet2,plantseq20)
#obtaining description of the tree
treeRatchet2##
## Phylogenetic tree with 20 tips and 18 internal nodes.
##
## Tip labels:
## Aeonium_arboreum , Aeonium_decorum , Adromischus_maculatus , Adromischus_cristatus , Kalanchoe_schizophylla , Kalanchoe_pinnata , ...
## Node labels:
## 1, 1, 1, 1, 0.73, 0.81, ...
##
## Unrooted; includes branch length(s).
#the following gives you the overall parsimony score of the best tree
phangorn::parsimony(treeRatchet2, plantseq20)## [1] 641
If you want to save your tree you need to add the tree into an object that we will call “my20seqTree”, and then use the function “write.tree” to save the tree:
#adding tree with bootstrap values into an object
my20seqTree <- phangorn::plotBS(treeRatchet2, type = "phylogram", main = 'Parsimony with Bootstrap values')In this exercise, we will perform a t-test to compare the mean plant heights between the Control and Treatment groups. The goal is to determine whether the difference in plant heights between these two groups is statistically significant.
Set your working directory in RStudio to the folder where you saved the CSV file. You can do this by navigating to Session > Set Working Directory > Choose Directory.
Then, run the following R code to load the data:
# Load the data from the CSV file into R
plant_data <- read.csv("plant_height_data.csv")
# View the first few rows of the data to confirm it's loaded correctly
head(plant_data)## Group Height
## 1 Control 25
## 2 Control 30
## 3 Control 28
## 4 Control 22
## 5 Control 27
## 6 Control 26
Now that your data is loaded, we can perform the t-test to compare the Control and Treatment groups:
# Perform the two-sample t-test comparing the Control and Treatment groups
t_test_result <- t.test(Height ~ Group, data = plant_data, var.equal = TRUE)
# Display the results of the t-test
print(t_test_result)##
## Two Sample t-test
##
## data: Height by Group
## t = -10.858, df = 18, p-value = 2.478e-09
## alternative hypothesis: true difference in means between group Control and group Treatment is not equal to 0
## 95 percent confidence interval:
## -15.51526 -10.48474
## sample estimates:
## mean in group Control mean in group Treatment
## 26.5 39.5
The output will include the t-value, degrees of freedom (df), p-value, and the confidence interval for the difference in means. Here’s how to interpret the results:
t-value: The larger the absolute t-value, the greater the difference between group means, suggesting a stronger effect. Degrees of Freedom (df): The higher the degrees of freedom, the more reliable the estimate of the difference. p-value: If p-value ≤ 0.05, the difference is statistically significant. If p-value > 0.05, the difference is not statistically significant. Confidence Interval: If the confidence interval does not include 0, it indicates a significant difference between the groups.
It’s always helpful to visualize the data to understand the distribution. Below, we will create a box plot to compare the height distributions between the two groups.
# Load ggplot2 library for visualization
library(ggplot2)
# Create a box plot to visualize the height differences
ggplot(plant_data, aes(x = Group, y = Height, fill = Group)) +
geom_boxplot() +
labs(x = "Group", y = "Height (cm)") +
theme_classic() +
scale_fill_manual(values = c("Control" = "darkred", "Treatment" = "darkblue"))Example Figure Caption and Results Section:
Figure Caption:
“Bar graph showing the mean (± SE) dry biomass for control and treatment groups. The sample size was n = 10 for each group. Error bars represent the standard error of the mean. A statistically significant difference between the groups was observed (p = 0.027).”
Results Section:
“There was a significant difference in dry biomass between the control and treatment groups (t = 2.46, df = 18, p = 0.027). Similarly, height measurements revealed no significant difference between the groups (t = 1.12, df = 18, p = 0.274). The number of buds was significantly greater in the treatment group (t = 3.85, df = 18, p = 0.002).”