Lab 3: Graphics Report Solutions

  1. For each of the following pairs of graphs, identify features that communicate better on one version than the other.

    1. Survivorship as a function of sex for passengers of the RMS Titanic.

    The graph on the left is a mosaic plot with distinct colors for survived and not survived for easy distinction between groups. The graph on the right has a larger font for the labels for easier readability. (Answer doesn’t have to be exact but something similar).

    1. Ear length in male humans as a function of age.

    The graph on the left has shorter axis labels which makes it easier to read. The graph on the right has a larger data points depicting better distributions of data. (Answer doesn’t have to be exact but something similar).

  1. Let’s use the data from “countries.csv” to practice making some graphs.

    1. Read the data from the file “countries.csv” in the Data folder.
    countries <- read.csv("C:\\BI412L\\ABDLabs\\ABDLabs\\DataForLabs\\countries.csv", stringsAsFactors = TRUE)
    1. Make sure that you have run library(ggplot2). Why is this necessary for the remainder of this question?
    library(ggplot2)
    ## Warning: package 'ggplot2' was built under R version 4.4.2

    The package ggplot2 helps us customize our visuals to depict the messages we want to convey so its important we load this package for us to use the ggplot functions for creating our visuals (graphs).

    1. Make a histogram to show the frequency distribution of values for measles_immunization_oneyearolds, a numerical variable. (This variable gives the percentage of 1-year-olds that have been vaccinated against measles.) Describe the pattern that you see.
    ggplot(countries, aes(x =  measles_immunization_oneyearolds)) + geom_histogram()
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
    ## Warning: Removed 2 rows containing non-finite outside the scale range
    ## (`stat_bin()`).

    Distribution is left (or negatively) skewed.

    1. Make a bar graph to show the numbers of countries in each of the continents. (The categorical variable continent indicates the continent to which countries belong.)
    ggplot(countries, aes(x = continent)) + geom_bar(stat = "count")

    1. Draw a scatterplot that shows the relationship between the two numerical variables life_expectancy_at_birth_male and life_expectancy_at_birth_female. What can you conclude based off the scatterplot?
    ggplot(countries, aes(x = life_expectancy_at_birth_male, y = life_expectancy_at_birth_female)) + geom_point()
    ## Warning: Removed 13 rows containing missing values or values outside the scale range
    ## (`geom_point()`).

    There is a strong positive correlation/association between male life expectancy at birth and female life expectancy at birth.

  2. The ecological footprint is a widely-used measure, developed at UBC, of the impact a person has on the planet. It measures the area of land (in hectares) required to generate the food, shelter, and other resources used by a typical person and required to dispose of that person’s wastes. Larger values of the ecological footprint indicate that the typical person from that country uses more resources. The countries data set has two variables for many countries showing the ecological footprint of an average person in each country. ecological_footprint_2000 and ecological_footprint_2012 show the ecological footprints for the years 2000 and 2012, respectively.

    1. Plot the relationship between the ecological footprint of 2000 and of 2012.
    ggplot(countries, aes(x = ecological_footprint_2000 , y = ecological_footprint_2012)) + geom_point()
    ## Warning: Removed 150 rows containing missing values or values outside the scale range
    ## (`geom_point()`).

    1. Describe the relationship between the footprints for the two years. Does the value of ecological footprint of 2000 seem to predict anything about its value in 2012?

    We can see a positive correlation, but it does not reveal a strong correlation due to the possible outliers. I would not say the value of ecological footprint of 2000 seem to predict anything about its value in 2012 due to this variation towards the right end in the data.

    1. From this graph, does the ecological footprint tend to go up or down in the years between 2000 and 2012? Did the countries with high or low ecological footprint change the most over this time? (Hint: you can add a one-to-one line to your graph by adding + geom_abline(intercept = 0, slope = 1) to your ggplot command. This will make it easier to see when your points are above or below the line of equivalence.)
    ggplot(countries, aes(x = ecological_footprint_2000 , y = ecological_footprint_2012)) + geom_point() + geom_abline(intercept = 0, slope = 1)
    ## Warning: Removed 150 rows containing missing values or values outside the scale range
    ## (`geom_point()`).

    Most points seem to be close to or above the one-to-one line, suggesting that the ecological footprint tends to have either remained similar or increased between 2000 and 2012. The points farther from the one-to-one line (in either direction) represent the countries with larger changes in their ecological footprint. Based on the graph, it seems that countries with a lower ecological footprint in 2000 tend to show more variation, both increasing and decreasing. In contrast, countries with a higher ecological footprint in 2000 tend to show smaller changes over this period, as most of these points remain closer to the one-to-one line.

  3. Use the countries data again. Plot the relationship between continent and female life expectancy at birth. Describe the patterns that you see.

    ggplot(countries, aes(x = continent , y = life_expectancy_at_birth_female)) + geom_boxplot()
    ## Warning: Removed 13 rows containing non-finite outside the scale range
    ## (`stat_boxplot()`).

    This box plot shows the life expectancy at birth for females across different continents. Africa has the lowest median life expectancy with a wide range, indicating significant variability among countries, and it also includes lower outliers. Asia displays a higher median than Africa but with a similarly large range, suggesting varied life expectancies across countries within the continent. Europe and North America have higher and more consistent life expectancies, with narrower interquartile ranges, indicating less variability. Oceania has a similar range and median to Europe and North America but includes a few lower outliers. South America shows a somewhat lower median than Europe and North America, with a moderate range and one outlier. Overall, Europe and North America have the highest and most stable life expectancies, while Africa and Asia show greater variability and generally lower life expectancies.(Can be something similar but plot must be a boxplot).

  4. Muchhala (2006) measured the length of the tongues of eleven different species of South American bats, as well as the length of their palates (to get an indication of the size of their mouths). All of these bats use their tongues to feed on nectar from flowers. Data from the article are given in the file “BatTongues.csv”. In this file, both Tongue Length and Palette Length are given in millimeters.

    1. Import the data and inspect it using summary(). You can call the data set whatever you like, but in one of the later steps we’ll assume it is called bat_tongues. Each value for tongue length and palate length is a species mean, calculated from a sample of individuals per species.
    bat_tounges <- read.csv("C:\\BI412L\\ABDLabs\\ABDLabs\\DataForLabs\\BatTongues.csv", stringsAsFactors = TRUE)
    1. Draw a scatter plot to show the association between palate_length and tongue_length, with tongue_length as the response variable. Describe the association: is it positive or negative? Is it strong or weak?
    ggplot(bat_tounges, aes(x = palate_length , y = tongue_length)) + geom_point()

    The scatter plot displays the relationship between palate length (x-axis) and tongue length (y-axis). The association appears to be weak, as the points are widely scattered with no clear pattern or trend. There is no strong, consistent increase or decrease in tongue length as palate length changes, suggesting that if there is any relationship, it is weak. Additionally, the association could be considered positive since there is a slight tendency for higher palate lengths to correspond with higher tongue lengths, but the relationship is not pronounced.

    1. All of the data points that went into this graph have been double checked and verified. With that in mind, what conclusion can you draw from the outlier on the scatter plot?

    In nature, biological variation is bound to exist. The individual or species represented by this data point might naturally have a disproportionately long tongue relative to palate length, highlighting an exceptional case.

    1. Let’s figure out which species is the outlier. To do this, we’ll use a useful function called filter() from the package dplyr. Use library to load dplyr to your R session.
    library(dplyr)
    ## 
    ## Attaching package: 'dplyr'
    ## The following objects are masked from 'package:stats':
    ## 
    ##     filter, lag
    ## The following objects are masked from 'package:base':
    ## 
    ##     intersect, setdiff, setequal, union
    1. The function filter() gives us the row (or rows) of a data frame that has a certain property. Looking at the graph, we can tell that the point we are interested in has a very long tongue length, at least over 80 mm long! The following command will pull out the rows of the data frame bat tongues that have tongue length greater than 80 mm: filter(bat_tongues$tongue_length> 80).
    filter(bat_tounges,tongue_length> 80)
    ##            species palate_length tongue_length
    ## 1 Anoura fistulata          12.4          85.2

    The species that is an outlier is Anoura fisulata.

  5. Import the data set collected on your class from the first lab. (We’ll return to some of the other variables later in the term.)

    student_data_lab1 <- read.csv("C:\\BI412L\\Lab 3 Graphics\\BI412L Student Data.csv", stringsAsFactors = TRUE)
    1. Plot the relationship between the gender and height. Do you detect any outliers? Do all the data look as though the same units were used when students recorded them?
    ggplot(student_data_lab1, aes(x = Sex , y = Height_cm)) + geom_boxplot()

    Based off just the boxplot, there seeems to be no outliers because (absent of outlier points) and it does seem like everyone used the same unit since our y-axis is within reasonable range.

    1. Draw a histogram of the distribution of head circumference.
    ggplot(student_data_lab1, aes(x =  Head_Circumference_cm)) + geom_histogram()
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    Note: Students may have picked a different binwidth and that is okay, you may mark them correct.

  6. Pick one of the plots you made using R today. What could be improved about this graph to make it a more effective presentation of the data?

Answer may vary, as long as it seems reasonable (i.e. they’ve justified their answers) why you may mark it correct.