Lab 5: Graphics Report Solutions

You may use ggplot() or R base functions to create visualizations (not restricted to one way).

  1. You were provided a data file that includes data that we collected as a class on information about ourselves during the Lab 1 report. The data file asked for the measurements in centimeters. Open that file in R. File in moodle was titled ”BI412L Student Data”.

    library(ggplot2)
    ## Warning: package 'ggplot2' was built under R version 4.4.2
    StudentData<-read.csv("C:\\BI412L\\Lab 5 Describing Data\\BI412L Student Data.csv", stringsAsFactors = TRUE)

    a. Plot the distribution of heights in the class. Describe the shape of the distribution. Is it symmetric or strongly skewed? Is it unimodal or bimodal? Key point: A distribution is skewed if it is asymmetric. A distribution is skewed right if there is a long tail to the right, and skewed left if there is a long tail to the left.

    ggplot(StudentData, aes(x =  Height_cm)) + geom_histogram(binwidth=4.5)

    hist(StudentData$Height_cm, xlab = "Histogram of Height_cm") 

    The histogram of height_cm generated by the hist() function shows what seems to be a bimodal distribution with two peaks from 150 to 155 and 160 to 165 fore height in cm.

    b. Are there any large outliers that look as though a student used the wrong units for their height measurement. (I.e., are there any that are more plausibly a height given in inches rather than the requested centimeters?) If so, and if this is not likely to be an accurate description of an individual in your class, use filter() from the package dplyr to create a new data set without those rows.

    You may make a boxplot to better see the distributions and any presence of outliers (You don’t have to; If you can determine based off the histogram whether that there are outliers present then by all means)

    ggplot(StudentData, aes(x = 1 , y = Height_cm)) + geom_boxplot()

    It looks like there are no outliers we can proceed with the same data.

    c. Use R to calculate the mean height of all students in the class, using the filtered data.

    mean(StudentData$Height_cm)
    ## [1] 162.1837

    The average height of recorded students in class is about 162.18 cm.

    d. Use sd() to calculate the standard deviation of height, using the filter data.

    sd(StudentData$Height_cm)
    ## [1] 7.99469

    The standard deviation of the recorded students’ height in class is about 7.99 cm.

  2. The file “caffeine.csv” contains data on the amount of caffeine in a 16 oz. cup of coffee obtained from various vendors. For context, doses of caffeine over 25 mg are enough to increase anxiety in some people, and doses over 300 to 360 mg are enough to significantly increase heart rate in most people. A can of Red Bull contains 80mg of caffeine.

    caffeine <- read.csv("C:\\BI412L\\ABDLabs\\ABDLabs\\DataForLabs\\caffeine.csv", stringsAsFactors = TRUE)

    a. What is the mean amount of caffeine in 16 oz. coffees?

    mean(caffeine$caffeine_mg_16oz)
    ## [1] 188.0643

    The mean amount of caffeine in 160z. coffees is 188.0643mg.

    b. What is the 95% confidence interval for the mean?

    t.test(caffeine$caffeine_mg_16oz)$conf.int
    ## [1] 167.5237 208.6049
    ## attr(,"conf.level")
    ## [1] 0.95

    The 95% confidence interval for the mean is (167.5237,208.6049).

    c. Plot the frequency distribution of caffeine levels for these data in a histogram. Is the amount of caffeine in a cup of coffee relatively consistent from one vendor to another? What is the standard deviation of caffeine level? What is the coefficient of variation?

    hist(caffeine$caffeine_mg_16oz, xlab = "Histogram of Height_cm") 

    Because the histogram seems non-uniformal, we can assume that the amount of caffeine in a cup of coffee is not relatively consistent from one vendor to another.

    sd(caffeine$caffeine_mg_16oz)
    ## [1] 35.57535
    (sd(caffeine$caffeine_mg_16oz, na.rm = TRUE)/mean(caffeine$caffeine_mg_16oz, na.rm = TRUE))*100
    ## [1] 18.91659

    The standard deviation of caffeine level is 35.57 mg. The coefficient of variation is 18.92 suggesting variability is present but not extreme.

    d. The file “caffeineStarbucks.csv” has data on six 16 oz. cups of Breakfast Blend coffee sampled on six different days from a Starbucks location. Calculate the mean (and the 95% confidence interval for the mean) for these data. Compare these results to the data taken on the broader sample of vendors in the first file. Describe the difference.

    caffeinesb <- read.csv("C:\\BI412L\\ABDLabs\\ABDLabs\\DataForLabs\\caffeineStarbucks.csv", stringsAsFactors = TRUE)
    
    mean(caffeinesb$caffeine_mg_16oz)
    ## [1] 371.9667
    t.test(caffeinesb$caffeine_mg_16oz)$conf.int
    ## [1] 239.3527 504.5806
    ## attr(,"conf.level")
    ## [1] 0.95

    The mean is 371.97 with a 95% Confidence Interval (239.35, 504.58). In comparison to the more broad sample in the previous file, the mean is much larger with a more bigger confidence interval, suggesting that with less of a sample size, would result in larger confidence in the mean.

  3. A confidence interval is a range of values that are likely to contain the true value of a parameter. Consider the “caffeine.csv” data again.

    caffeine <- read.csv("C:\\BI412L\\ABDLabs\\ABDLabs\\DataForLabs\\caffeine.csv", stringsAsFactors = TRUE)

    a. Calculate the 99% confidence interval for the mean caffeine level.

    t.test(caffeine$caffeine_mg_16oz, conf.level = 0.99)$conf.int
    ## [1] 159.4238 216.7047
    ## attr(,"conf.level")
    ## [1] 0.99

    b. Compare this 99% confidence interval to the 95% confidence interval you calculate in question 2b. Which confidence interval is wider (i.e., spans a broader range)? Why should this one be wider?

    t.test(caffeine$caffeine_mg_16oz)$conf.int
    ## [1] 167.5237 208.6049
    ## attr(,"conf.level")
    ## [1] 0.95

    With a 95% CI, the range is much smaller than the 99% CI. The reason is because with a 95% CI, we aren’t as confident as if we were with a 99% and so our range will be larger because we are making some “flexibility” for error with approximating the mean 5% less than with a 99% CI.

    c. Let’s compare the quantiles of the distribution of caffeine to this confidence interval. Approximately 95% of the data values should fall between the 2.5% and 97.5% quantiles of the distribution of caffeine. (Explain why this is true.) We can use R to calculate the 2.5% and 97.5% quantiles with a command like the following. (Replace “datavector” with the name of the vector of your caffeine data.)

    Example R Code: quantile(datavector, c(0.025, 0.975), na.rm =TRUE)

    quantile(caffeine$caffeine_mg_16oz, c(0.025, 0.975), na.rm =TRUE)
    ##    2.5%   97.5% 
    ## 144.765 254.685

    Are these the same as the boundaries of the 95% confidence interval? If not, why not? Which should bound a smaller region, the quantile or the confidence interval of the mean?

    No, they aren’t the same because the quantile is calculating the variability of the data points. The CI is narrower because it estimates the location of the population mean, which is less variable than individual data points. Quantiles reflect the variability of the full dataset and the CI accounts for the sampling error and variability in the mean.

  4. Return to the class data set “studentSampleData.csv”. Find the mean value of “number of siblings.” Add one to this to find the mean number of children per family in the class.

    StudentData<-read.csv("C:\\BI412L\\Lab 5 Describing Data\\BI412L Student Data.csv", stringsAsFactors = TRUE)
    mean(StudentData$Siblings)
    ## [1] 2.354839
    1 + mean(StudentData$Siblings)
    ## [1] 3.354839

    a. The mean number of offspring per family twenty years ago was about 2. Is the value for this class similar, greater, or smaller? If different, think of reasons for the difference.

    The value is actually greater by 1 which could possibly be attributed to a combination of social, economic, cultural and biological factors. One specific reason could be the major increase in improvements with healthcare (Less Infant Mortality and Better Maternal Health).

    b. Are the families represented in this class systematically different from the population at large? Is there a potential sampling bias?

    Yes, it could definitely be biased beacuse geographically this data set only represents a single region (Students enrolled in BI412L class in the FA24 semester on Guam).

    c. Consider the way in which the data were collected. How many families with zero children are represented? Why? What effect does this have on the estimated mean family size of all couples?

    Well, because we collected data from students in class, that automatically means they are a biological product of two parents, meaning the family sampled has at least one child, thus there should be no families that do not have any children.

    The estimated mean is then biased towards families with at least one child and not families with no children. Thus if we did include families with 0 children, the mean in children would most likely decrease by one being the same as the mean amount of off springs 20 years ago.

  5. Return to the data on countries of the world, in “countries.csv”. Plot the distributions for ecological footprint 2000, cell phone subscriptions per 100 people 2012, and life expectancy at birth female.

    countries <- read.csv("C:\\BI412L\\ABDLabs\\ABDLabs\\DataForLabs\\countries.csv", stringsAsFactors = TRUE)

    a. For each variable, plot a histogram of the distribution. Is the variable skewed? If so, in which direction?

    hist(countries$ecological_footprint_2000)

    ggplot(countries, aes(x = ecological_footprint_2000)) + geom_histogram()
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
    ## Warning: Removed 58 rows containing non-finite outside the scale range
    ## (`stat_bin()`).

    hist(countries$cell_phone_subscriptions_per_100_people_2012)

    ggplot(countries, aes(x = cell_phone_subscriptions_per_100_people_2012)) + geom_histogram()
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
    ## Warning: Removed 10 rows containing non-finite outside the scale range
    ## (`stat_bin()`).

    hist(countries$life_expectancy_at_birth_female)

    ggplot(countries, aes(x = life_expectancy_at_birth_female)) + geom_histogram()
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
    ## Warning: Removed 13 rows containing non-finite outside the scale range
    ## (`stat_bin()`).

    Ecological Footprint 2000 has a right-skewed distribution, cell phone subscriptions per 100 people 2012 has a normal distribution, and life expectancy at birth female has in general a left skewed distribution.

    b. For each variable, calculate the mean and median. Are they similar? Match the difference in mean and median to the direction of skew on the histogram. Do you see a pattern?

    mean(countries$ecological_footprint_2000, na.rm = TRUE)
    ## [1] 3.147391
    median(countries$ecological_footprint_2000, na.rm = TRUE)
    ## [1] 2.14
    mean(countries$cell_phone_subscriptions_per_100_people_2012, na.rm = TRUE)
    ## [1] 99.90419
    median(countries$cell_phone_subscriptions_per_100_people_2012, na.rm = TRUE)
    ## [1] 103.25
    mean(countries$life_expectancy_at_birth_female, na.rm = TRUE)
    ## [1] 73.4153
    median(countries$life_expectancy_at_birth_female, na.rm = TRUE)
    ## [1] 75.9

    Ecological Footprint 2000 has a right-skewed distribution so it makes sense that the median < mean, cell phone subscriptions per 100 people 2012 has a normal distribution so the median and the mean are equal to eachother , and life expectancy at birth female has in general a left skewed distribution so it also makes sense that the median > mean as those are the characteristics for those distributions.