Introduction

Garlic mustard (Alliaria petiolata) is a plant native to Europe that is invasive in North America. This means that it impacts the environment so much that it negatively affects native ecosystem functioning. A. petiolata was introduced in North America in 1800s to replace real garlic, which couldn’t grow in the climate of eastern North America. Since then it has become a serious pest in many states and Canadian provinces.

Researchers and land managers are very interested why garlic mustard is not invasive in its native range (Europe) but is horribly invasive in its introduced range (North America). One leading hypothesis is called EICA (Evolution of Increased Competitive Ability), which posits that the plants brought to North America came without their native insect predators, such as butterfly/moth larvae. In its native range garlic mustard produces nitrogen-rich toxins to deter those predators. If they are in North America and do not need those toxins anymore, the EICA hypothesis suggests they should evolve to shunt the extra energy away from making toxins and into making bigger plants that outcompete native vegetation. Although this hypothesis is pretty well-regarded, there haven’t been continent-wide tests of the theory.

A few of Dr. Andy McCall’s colleagues (Biology) got together to set up such a test in Europe and North America. Specifically, they were interested in whether garlic mustard plants in North America were truly bigger and more fit (had the potential for more offspring) than plants in Europe. They asked scientists across hundreds of populations to take measurements on plants in 1m by 5m plots; these data were then uploaded and collated into the data you are about to use!

Ethical Considerations

Garlic mustard’s invasiveness is a serious economic and social issue. Eradication efforts that cost millions of dollars have been underway for decades, but the plant continues to plague natural areas. There are many ethical stakeholders in this study: from conservationists who’d like to preserve natural areas, farmers trying to keep land clear of invasive plants, and the ecologists themselves running the study. The interests of these groups sometimes overlap, but they can also be at odds. An approach that might seem good to a farmer may be objectionable to a conservationist, or vice versa. Even though this data is not about people, treating this data with care is important because of all the people these plants effect.

Data Explanation and Exploration

Loading Data and Libraries

First, we need to get some data! Luckily, we have access to a great dataset that is part of a worldwide study on invasive species. To understand how these data were collected, please look at the Metadata document in the Github repo.

Make sure the GarlicMustardData.csv file is downloaded and in the same folder (directory) as this RMarkdown file, otherwise none of the next steps will work! It’s best to start a new RStudio “Project”, as mentioned in the general lab guide.

Before continuing with this lab, you should stop here and read a few short sections of our textbook that introduce the concepts of code libraries, packages, and functions, as well as data imports. Please take a look at the section on the tidyverse as well as the Getting Started with Data Import. These are short readings that contain some crucial R skills!

Now, we have to load the packages we need into your current session. Install the tidyverse package the way that the textbook shows you. Once that’s finished, you can use the function library, which takes as its argument the name of a package. So, to load our packages for this session, you would enter the following: library(tidyverse). Enter that into the codeblock below and run it.

── Attaching packages ─────────────────────────────────────────────── tidyverse 1.3.1 ── ✓ tibble 3.1.6 ✓ purrr 0.3.4 ✓ tidyr 1.1.4 ✓ stringr 1.4.0 ✓ readr 2.1.1 ✓ forcats 0.5.1 ── Conflicts ────────────────────────────────────────────────── tidyverse_conflicts() ── x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag()

Remember, tidyverse is actually a group of libraries that make data analysis in R much easier. In this case, we’re importing tidyverse because of two specific libraries we’d like to use: dplyr and ggplot. We’ll learn a lot more about both of those libraries in the next two weeks!

Now you can import the data from your working directory and call it GarlicMustardData. The code should look like this: GarlicMustardData <- read_csv("GarlicMustardData.csv"). Type that into that empty codeblock below.

Rows: 404 Columns: 37
0s── Column specification ──────────────────────────────────────────────────────────────── Delimiter: “,” chr (3): Pop_Code, Region, Collection_Date dbl (34): Latitude, Longitude, Altitude, Pop_Size, Pct_Canopy_Cover, RosCount, Adult…

ℹ Use spec() to retrieve the full column specification for this data. ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

Sometimes it is good to take a global look at your data after importing it. One way to do this is to use the summary function which takes a data set name as its argument: summary(GarlicMustardData) It returns a list of the variables in the data set and information about those variables. For example, for numerical data it returns values like this:

Pop_Code Region Collection_Date Latitude
Length:404 Length:404 Length:404 Min. :33.11
Class :character Class :character Class :character 1st Qu.:40.81
Mode :character Mode :character Mode :character Median :43.72
Mean :44.86
3rd Qu.:48.40
Max. :57.02

Longitude Altitude Pop_Size Pct_Canopy_Cover RosCount
Min. :-123.406 Min. : 0.0 Min. : 3 Min. : 0.00 Min. : 0.0
1st Qu.: -79.653 1st Qu.: 40.0 1st Qu.: 25 1st Qu.: 50.00 1st Qu.: 15.5
Median : -71.759 Median : 164.0 Median : 100 Median : 70.00 Median : 72.0
Mean : -40.642 Mean : 226.8 Mean : 4601 Mean : 65.09 Mean : 227.2
3rd Qu.: 8.908 3rd Qu.: 298.2 3rd Qu.: 450 3rd Qu.: 80.00 3rd Qu.: 236.2
Max. : 42.015 Max. :1711.5 Max. :745000 Max. :100.00 Max. :6538.0
NA’s :11 NA’s :13
AdultCount RosDens AdultDens TotalDens AvgRosWidth
Min. : 0 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.3643
1st Qu.: 28 1st Qu.: 3.20 1st Qu.: 6.65 1st Qu.: 21.00 1st Qu.: 4.5068
Median : 61 Median : 16.58 Median : 14.80 Median : 42.14 Median : 6.8226
Mean : 118 Mean : 48.33 Mean : 26.89 Mean : 75.22 Mean : 7.9126
3rd Qu.: 142 3rd Qu.: 49.65 3rd Qu.: 30.05 3rd Qu.: 88.65 3rd Qu.: 9.4900
Max. :1447 Max. :1307.60 Max. :413.43 Max. :1357.40 Max. :46.2563
NA’s :41
AvgAdultHeight AvgNLeaves AvgNFruits Herb bio1
Min. : 5.293 Min. : 0.000 Min. : 0.00 Min. :0.0000 Min. : 3.540
1st Qu.: 55.600 1st Qu.: 7.814 1st Qu.: 11.62 1st Qu.:0.1012 1st Qu.: 8.638
Median : 71.333 Median : 10.538 Median : 20.21 Median :0.2468 Median : 9.511
Mean : 71.845 Mean : 13.931 Mean : 31.53 Mean :0.3280 Mean : 9.671
3rd Qu.: 86.865 3rd Qu.: 15.767 3rd Qu.: 39.88 3rd Qu.:0.5041 3rd Qu.:10.771
Max. :148.600 Max. :173.500 Max. :421.00 Max. :1.0000 Max. :16.787
NA’s :23 NA’s :34 NA’s :19 NA’s :40
bio2 bio3 bio4 bio5 bio6
Min. : 6.213 Min. :0.2075 Min. :0.01286 Min. :18.46 Min. :-17.732
1st Qu.: 8.553 1st Qu.:0.2964 1st Qu.:0.02391 1st Qu.:23.33 1st Qu.: -8.265
Median :10.218 Median :0.3175 Median :0.02933 Median :26.57 Median : -6.327
Mean :10.139 Mean :0.3180 Mean :0.02740 Mean :26.24 Mean : -5.981
3rd Qu.:11.437 3rd Qu.:0.3315 3rd Qu.:0.03088 3rd Qu.:28.98 3rd Qu.: -3.465
Max. :15.799 Max. :0.4253 Max. :0.04070 Max. :34.37 Max. : 4.429

  bio7            bio8             bio9             bio10           bio11

Min. :16.95 Min. :-1.519 Min. :-8.9392 Min. :13.74 Min. :-9.6822
1st Qu.:27.29 1st Qu.:13.465 1st Qu.:-1.2444 1st Qu.:17.16 1st Qu.:-1.8492
Median :34.29 Median :17.038 Median : 0.4213 Median :19.50 Median :-0.2288
Mean :32.22 Mean :15.631 Mean : 2.6439 Mean :19.37 Mean :-0.4633
3rd Qu.:36.66 3rd Qu.:19.256 3rd Qu.: 3.9786 3rd Qu.:21.58 3rd Qu.: 1.1212
Max. :45.38 Max. :23.993 Max. :23.7206 Max. :25.85 Max. : 9.0717

 bio12            bio13           bio14            bio15             bio16

Min. : 409.0 Min. :12.00 Min. : 0.000 Min. :0.06064 Min. :141.6
1st Qu.: 752.0 1st Qu.:19.23 1st Qu.: 8.292 1st Qu.:0.11296 1st Qu.:230.9
Median : 959.0 Median :24.63 Median :11.788 Median :0.17024 Median :303.5
Mean : 925.8 Mean :23.71 Mean :12.325 Mean :0.21134 Mean :287.7
3rd Qu.:1123.0 3rd Qu.:26.70 3rd Qu.:17.116 3rd Qu.:0.27428 3rd Qu.:324.5
Max. :1530.4 Max. :55.70 Max. :22.647 Max. :1.04230 Max. :674.0

 bio17              bio18              bio19

Min. : 0.4583 Min. : 0.5635 Min. : 46.8
1st Qu.:125.7550 1st Qu.:225.2529 1st Qu.:153.1
Median :172.6889 Median :277.3571 Median :194.6
Mean :178.2972 Mean :262.7103 Mean :198.7
3rd Qu.:250.4504 3rd Qu.:308.4265 3rd Qu.:260.8
Max. :304.0691 Max. :389.3464 Max. :637.3

Besides things like the minimum and maximum values for a variable, it returns the median and the mean. It also returns the 1st quartile value, which is the value that is the middle number between the smallest number and the median of the dataset. The 2nd quartile is the median of the data, and the 3rd quartile is the middle value between the median and the highest value of the dataset. So looking at the quartiles tells you something about the spread of the data.

Data Preparation and Summarization

Now we’re ready to start manipulating our data sets.

Let’s create a new vector of just 3 variables. In the codeblock below, enter the following: myvars <- c("Latitude", "Longitude", "Altitude"). Then ask R to make a new data table called GarlicMustardGeo and fill it with only those columns of data corresponding to the values in the vector myvars. Note that myvars is in brackets, not parentheses: GarlicMustardGeo <- GarlicMustardData[myvars]. At this point, we should probably check to see if we really do have a reduced data table with only those three variables. You can do this using the summary function as before: summary(GarlicMustardGeo). Enter all of these on separate lines into the next codeblock.

Latitude Longitude Altitude
Min. :33.11 Min. :-123.406 Min. : 0.0
1st Qu.:40.81 1st Qu.: -79.653 1st Qu.: 40.0
Median :43.72 Median : -71.759 Median : 164.0
Mean :44.86 Mean : -40.642 Mean : 226.8
3rd Qu.:48.40 3rd Qu.: 8.908 3rd Qu.: 298.2
Max. :57.02 Max. : 42.015 Max. :1711.5

So, did you have data only on latitude, longitude, and altitude?

Now, let’s see if we can select non-adjoining variables (columns) from the original data set and then put them into a new data set, GarlicMustard_subset. We will do this using select from the dplyr package: GarlicMustard_subset <- select(GarlicMustardData,1,5:8). Try looking at the help for select to see how it works.

Look at the new data using the summary function as above. Did it work?

Pop_Code Longitude Altitude Pop_Size
Length:404 Min. :-123.406 Min. : 0.0 Min. : 3
Class :character 1st Qu.: -79.653 1st Qu.: 40.0 1st Qu.: 25
Mode :character Median : -71.759 Median : 164.0 Median : 100
Mean : -40.642 Mean : 226.8 Mean : 4601
3rd Qu.: 8.908 3rd Qu.: 298.2 3rd Qu.: 450
Max. : 42.015 Max. :1711.5 Max. :745000
NA’s :11
Pct_Canopy_Cover Min. : 0.00
1st Qu.: 50.00
Median : 70.00
Mean : 65.09
3rd Qu.: 80.00
Max. :100.00
NA’s :13

Now, use the filter function to select all rows with a total density of 4 or greater and an altitude of 100 or greater:

GM_filtered <- GarlicMustardData %>%
  filter(TotalDens>=4 & Altitude>=100)

Check to see if this worked by using the summary function.

Pop_Code Region Collection_Date Latitude
Length:245 Length:245 Length:245 Min. :33.11
Class :character Class :character Class :character 1st Qu.:40.51
Mode :character Mode :character Mode :character Median :41.97
Mean :43.74
3rd Qu.:47.40
Max. :57.02

Longitude Altitude Pop_Size Pct_Canopy_Cover RosCount
Min. :-123.35 Min. : 102.0 Min. : 3 Min. : 0.00 Min. : 0.0
1st Qu.: -84.58 1st Qu.: 191.0 1st Qu.: 30 1st Qu.:60.00 1st Qu.: 12.0
Median : -73.83 Median : 265.7 Median : 100 Median :70.00 Median : 69.0
Mean : -42.88 Mean : 347.0 Mean : 7207 Mean :67.17 Mean : 240.2
3rd Qu.: 14.43 3rd Qu.: 380.0 3rd Qu.: 750 3rd Qu.:83.50 3rd Qu.: 264.0
Max. : 42.02 Max. :1711.5 Max. :745000 Max. :99.00 Max. :6538.0
NA’s :4 NA’s :10
AdultCount RosDens AdultDens TotalDens AvgRosWidth
Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 4.20 Min. : 0.3643
1st Qu.: 23.0 1st Qu.: 2.60 1st Qu.: 5.20 1st Qu.: 19.00 1st Qu.: 4.6422
Median : 54.0 Median : 14.50 Median : 12.00 Median : 40.00 Median : 6.8000
Mean :107.7 Mean : 50.62 Mean : 24.75 Mean : 75.36 Mean : 7.9430
3rd Qu.:122.0 3rd Qu.: 55.20 3rd Qu.: 27.60 3rd Qu.: 88.60 3rd Qu.: 9.5243
Max. :952.0 Max. :1307.60 Max. :190.40 Max. :1357.40 Max. :46.2563
NA’s :26
AvgAdultHeight AvgNLeaves AvgNFruits Herb bio1
Min. : 9.00 Min. : 0.000 Min. : 0.00 Min. :0.0000 Min. : 3.742
1st Qu.: 56.63 1st Qu.: 7.584 1st Qu.: 12.78 1st Qu.:0.1083 1st Qu.: 8.511
Median : 74.44 Median :10.628 Median : 21.60 Median :0.2558 Median : 9.555
Mean : 73.22 Mean :14.524 Mean : 34.05 Mean :0.3417 Mean : 9.601
3rd Qu.: 88.18 3rd Qu.:18.654 3rd Qu.: 48.16 3rd Qu.:0.5359 3rd Qu.:10.611
Max. :148.60 Max. :71.611 Max. :269.71 Max. :1.0000 Max. :16.787
NA’s :16 NA’s :19 NA’s :14 NA’s :22
bio2 bio3 bio4 bio5 bio6
Min. : 6.377 Min. :0.2075 Min. :0.01314 Min. :18.46 Min. :-17.210
1st Qu.: 9.471 1st Qu.:0.2960 1st Qu.:0.02523 1st Qu.:24.50 1st Qu.: -8.874
Median :10.611 Median :0.3159 Median :0.03004 Median :27.07 Median : -6.826
Mean :10.613 Mean :0.3155 Mean :0.02876 Mean :26.87 Mean : -6.967
3rd Qu.:11.881 3rd Qu.:0.3268 3rd Qu.:0.03143 3rd Qu.:29.16 3rd Qu.: -5.242
Max. :15.799 Max. :0.4143 Max. :0.04070 Max. :34.37 Max. : 4.429

  bio7            bio8             bio9             bio10           bio11

Min. :17.16 Min. : 1.989 Min. :-8.9392 Min. :13.74 Min. :-9.6251
1st Qu.:30.16 1st Qu.:15.566 1st Qu.:-1.9976 1st Qu.:17.59 1st Qu.:-2.5187
Median :35.83 Median :17.651 Median :-0.4653 Median :19.74 Median :-0.7929
Mean :33.84 Mean :16.845 Mean : 0.7556 Mean :19.72 Mean :-1.0962
3rd Qu.:37.33 3rd Qu.:19.376 3rd Qu.: 1.1841 3rd Qu.:21.58 3rd Qu.: 0.1578
Max. :45.38 Max. :23.993 Max. :23.7206 Max. :25.85 Max. : 9.0717

 bio12            bio13           bio14           bio15             bio16

Min. : 409.0 Min. :12.00 Min. : 0.00 Min. :0.06064 Min. :141.6
1st Qu.: 752.0 1st Qu.:21.16 1st Qu.: 6.67 1st Qu.:0.13892 1st Qu.:257.0
Median : 959.0 Median :24.87 Median :11.06 Median :0.19990 Median :304.6
Mean : 912.7 Mean :24.06 Mean :11.61 Mean :0.24540 Mean :291.7
3rd Qu.:1101.0 3rd Qu.:26.68 3rd Qu.:15.69 3rd Qu.:0.33071 3rd Qu.:331.4
Max. :1394.0 Max. :42.21 Max. :22.65 Max. :1.04230 Max. :485.7

 bio17              bio18              bio19

Min. : 0.4583 Min. : 0.5635 Min. : 46.8
1st Qu.:104.0951 1st Qu.:234.9915 1st Qu.:117.6
Median :170.9110 Median :289.1817 Median :177.6
Mean :168.5065 Mean :272.6435 Mean :180.7
3rd Qu.:226.3498 3rd Qu.:313.0674 3rd Qu.:231.3
Max. :304.0691 Max. :389.3464 Max. :484.0

This is great, but let's suppose that you are interested in visualizing your data and testing specific scientific predictions using your new skills.

STOP HERE

Great job getting to this point on your own! Now we’re going to spend some time getting accustomed to the visualization library we will use all semester: ggplot. Start by looking over Chapter 3 of our textbook. Follow along with me as I show some ggplot examples using our Garlic Mustard data. If you want, you can enter them in separate codeblocks below. Here’s one to get you started:

ggplot(data = GM_filtered) + geom_point(mapping = aes(x = Longitude, y = Latitude))

Once we’ve gone through the ggplot examples, you’re ready to go through the rest of the lab on your own!

Testing predictions and viewing results

Let’s suppose your boss asks you if A. petiolata conforms to the EICA hypothesis predictions. In particular, she wants to know if plants in North America are truly healthier and bigger than plants in Europe.

To begin, can you plot the distribution of sizes for both North American and European plants?

Visualizing data and comparing values

First, make a subset that only has plants from North America:

NorthAmericaData <- GarlicMustardData %>%
    filter(Region == "NorthAm")

Type the above into the codeblock below, and pay attention to how dplyr’s filter() verb is creating a data subset:

How can you check to see if you only have the data from North America? If the code above ran properly, a new variable called NorthAmericaData should appear in the “Enviroment” window in the top right quadrant of your RStudio screen. Click on that to see a view of the data.

Did your filter function work?

Now, use the hist function, which has the data table name and type of analysis as arguments, to make a histogram (or a frequency distribution) of the average adult height of plants in North America. As you do this, you also need to select which column of data that will be used as the input. In this case it is ‘AvgAdultHeight’ and you select it by using the ’$' symbol: `hist(NorthAmericaData$AvgAdultHeight)`.

Look at the help file for hist. Can you see how to add additional parameters to your plot (e.g., how to add color to the bars, make the x axis label (xlab) read in plain English? Anything else that looks interesting?

hist(Average Adult Height Of Garlic Mustard Plants, xlab = “Number of Specimens”, col = “green”, border = “black”)

Now, let’s try using ggplot2 to make a better histogram of adult height for European plants.

To do this, we will need to start by setting up our plotting canvas, and by choosing the correct geom to tell it what kind of plot we want to make. For a histogram, we can use geom_histogram(). For example, to make the same plot we just did using hist(), we could type:

ggplot(NorthAmericaData, aes(AvgAdultHeight)) +
    geom_histogram()

Way better, right!? We will continue to practice and improve our ggplot abilities throughout this course, as it is the modern way of making great data visualizations in R.

YOU TRY IT:

Now, try making a histogram of adult height for European plants. You might want to first sketch out the steps needed, the order they come in, and then match them with the commands you have just learned before you jump into the problem. What do you expect your histogram will show?

EuropeanData <- GarlicMustardData %>% filter(Region == “Europe”)

ggplot(EuropeanData, aes(AvgAdultHeight)) + geom_histogram()

What is the purpose of showing a histogram of the data?

I expect my histogram to show the distribution of average adult heights of garlic mustard plants in Europe. The purpose of this is to aid in visualization and make it easier to quickly compare average adult heights of plants in North America and Europe.

Worksheet #1: Go back to your histograms above (Europe and North America) and make sure they look correct and readable. Add a title to each graph, or include a caption (by typing just below the codeblock/graph) so it is obvious which one is which.

Worksheet #2: Remember your boss? Oh yeah!… she asked you to evaluate the EICA hypothesis. Can you give her some information based on the histograms? How confident are you in your response? Why or why not? Write your answer below.

Plants in N. America do not appear to be larger on average than plants in Europe. I am very confident on that based on the data. However, the data I have been working with do not show if they are healthier or not. While size is positively correlated with health, I am less confident that these plants are not healthier than their European equivalents.

Maybe you would feel more confident with a number that summarizes those plant heights, like the mean. Let’s use the mean function, which takes as its argument a vector of numbers.

Try finding the mean plant height (AvgAdultHeight) of plants in North America.

Remember to select a particular column in a data set you need to use the $ operator like when you made the histograms.

Worksheet #3: Please write your command to calculate the mean height of plants in North America:

mean(NorthAmericaData$AvgAdultHeight)

Now, run it!

Worksheet #4: What happened? Why do you think this happened?

There are missing values, so the operation returned NA (not applicable).

Here’s a possible solution when you have missing values:

mean(NorthAmericaData$AvgAdultHeight, na.rm = TRUE)

Now you should probably do the same thing for the plants from Europe.

Worksheet #5: What are the mean Adult Heights for each region?


Europe: 77.53073

North America: 67.11545

**Worksheet #6: So, do the histograms and means support EICA's predictions about plant height? Why or why not? Write your answers below.**

## No, they don't support the prediction that American garlic mustard plants are larger and healthier than their European counterparts, because North American plants are actually smaller on average.


**Worksheet #7: Are there other variables of the plant in this dataset besides height that might measure plant 'size' or 'health'? Why or why not? If you think there are measures that represent these things, please list them below. If not, what would you measure instead? Write your answers below.**

## AvgRosWidth, AvgNFruits and AvgNLeaves. A larger, healthier plant is likely to have a wider rose, more fruits or more leaves.


**Try testing the difference between the regions by using another variable that indicates plant size or plant health and by using histograms and means:**

> ggplot(EuropeanData, aes(AvgNLeaves)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
Removed 8 rows containing non-finite values (stat_bin). 
> ggplot(NorthAmericaData, aes(AvgNLeaves)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
Removed 26 rows containing non-finite values (stat_bin). 
> mean(NorthAmericaData$AvgNLeaves, na.rm = TRUE)
[1] 12.53682
> mean(EuropeanData$AvgNLeaves, na.rm = TRUE)
[1] 15.62682


### Visualizing relationships between variables

Your boss also suspects that precipitation might affect plants, rather than where they are found.

Can you plot the relationship between annual precipitation and plant height for all of the plants in the dataset?

First, you will need to make a plot. A **scatterplot** is a great way to visualize a possible relationship or association between 2 variables.  Like before when we used hist, there is a base R way to make a quick plot, but it isn't very pretty.

plot(data$x, data$y)


Here, the `x variable` is the independent variable and the `y variable` is the variable that the scientists measured on the plant.  `data` refers to the name of a specific data set you are using (you would fill this in with the name of *your* dataset).

Again, we can use ggplot to make a much better plot. The syntax is going to be similar to when we made a histogram, but we will need a different geom, to tell R to make a scatterplot. Since scatterplots use points, we can use `geom_point()`. Below is a generic example, you will need to fill in the names of your dataset and the columns you want to plot to make it work.

ggplot(data, aes(x=colname1, y=colname2)) + geom_point() ```

Worksheet #8. Please write your command to plot the relationship between annual precipitation (hint: look at the bioclim variables) and average adult plant height and produce the plot:

ggplot(GarlicMustardData, aes(x=bio12, y=AvgAdultHeight)) + + geom_point()

Worksheet #9: Does there look like there is a relationship? If so, does it make sense to you? Why or why not? Write your answer below:

It looks like there is a weak negative relationship. A regression line drawn through the scatterplot would have a negative slope, but the correlation coefficient is low.

Finally, another common plot used to compare and look at the spread of data is the boxplot (example below). A boxplot visually shows us the median (bold middle line in each box), the 25 and 75 percentiles (also known as the 1st and 3rd quartiles or the lower and upper bounds of the box), and the whiskers which extend near to the min and max of the data. This way, outliers can be plotted separately and are more visible as points. The whiskers extend from the box edges to the largest and smallest value that is no further away than 1.5*IQR (interquartile range; the distance between the 1st and 3rd quartiles). Data beyond the end of the whiskers are called “outlier” points and are plotted individually. You can read more and see many examples of boxplots to try here.

YOU TRY IT:

Looking at the help for geom_boxplot online, see if you can figure out how to create a graph like the one below, but using your garlic mustard data. Choose one measure of plant health or growth, and another variable that you think could influence plant health/growth or might lead to differences in measured health/growth. As usual, your dependent variable should be on the y-axis, and your independent variable on the x-axis. For this kind of plot, do you think it matters if the variables you choose are continuous (numbers) or discrete (categories)?

Worksheet #10: Create a boxplot with appropriate variables on the x- and y-axis. Please include a few sentences describing what your dependent and independent variables are, why you chose them, and if you think the plot suggests that there are differences or not. Why?

ggplot(GarlicMustardData, aes(x=bio1, y=AvgNFruits)) + geom_boxplot()

My hypothesis was that plants that grow in an area with higher average annual temperatures would grow more fruits. Honestly I can’t tell anything from the boxplot because the window is not calibrated correctly. However, I’m not sure how to fix it right now.

Note: When you are finished coding and want to close RStudio, it may ask if you want to save your R Workspace. This is usually not a good idea, as it can save variables you have created in this session, and auto upload them to the next session, which can lead to unintended problems later. It is safest to say “NO”.

DO make sure to save your RMarkdown file and practice using the “Knit” button above as you go. When you’re done, you can knit the HTML file and post both it and the RMarkdown file to your Github repo.

Conclusion

It was pretty surprising to find that American garlic mustard plants are actually smaller than the European garlic mustard plants. They also had fewer leaves than the European garlic mustard plants. That was the opposite of what I had expected to see. I think this may be because outside of their natural habitat, overcrowding causes the American garlic mustard plants to have less room to grow. This hypothesis definitely needs more study.