ENVS 193DS - Homework 2

Author

Eliana Shandalov

Published

April 14, 2025

Set up

Problem 1. Burrowing owl abundance

a.

The data is discrete because all of the values are whole numbers, and each number represents the count of total individuals present.

b.

owl_counts <- c(0, 2, 0, 3, 1, 4, 1, 2) # store values of numbers of owls observed weekly as an object called 'owl_counts'

sd <- sd(owl_counts)
# calculate the standard deviation of 'owl_counts'

mean(owl_counts) # calculate the mean of the owl counts
[1] 1.625

The standard deviation is a better measure of variance for this data set because it focuses on the spread of data rather than the precision. The standard deviation is 1.41 owls (the mean is shown above).

c.

n <- length(owl_counts)
# count up number of data points in 'owl_counts' and store it as an object called 'n'

se = sd/sqrt(n)
# find the standard error by using the formula, appears in environment

Standard error is a better measure of uncertainty for this sample because it measures the accuracy of the sample, whereas standard deviation focuses on the spread. The standard error is 0.50 owls.

Problem 2. Fire and particulate matter

a.

sbpm <- sbpm |> # store sbpm as new object
  mutate(row_number = row_number()) # number each row in the data set 'sbpm'

ggplot(data = sbpm, # use sbpm data frame
       aes(x = date, # use information from 'date' column on the x-axis
           y = pm2_5, # use information from 'pm2_5 column on the y-axis
       color = local_site_name # color each line by sensor site name
          )) + 
  geom_line() + # create line graph
  
labs(x = "Date", # label x-axis 'Date'
     y = "PM2.5", # label y-axis 'PM 2.5'
     color = "Sensor Site Locations") + # change name of legend to 'Sensor Site Locations"
  theme(
  legend.background = element_rect(color = "black", linewidth = 1, fill = "pink"))  # put border around legend, and make the color of the legend inside the box light pink

b.

gol_sb <- sbpm %>% # store sbpm as an object called gol_sub
  filter(local_site_name %in% c("Goleta", "Santa Barbara")) # filter the data set column 'local_site_name' to only include Goleta and Santa Barbara data points

# new object appears in environment

c.

From the start of the fire until when it was contained, there was a difference in PM2.5 between Goleta and Santa Barbara.

d. 

ggplot(data = gol_sb, # make a plot using data from 'gol_sb'
       aes(x = date, # put information from date column on x-axis
           y = pm2_5, # put information from pm2_5 on y-axis
           color = local_site_name)) + # color by goleta or sb using information from 'local_site_name column'
  geom_boxplot() + # create a boxplot
  geom_jitter(position = position_jitter(width = 0.2, # jitter the points, and position the jitter with width 0.2
              height = 0)) + # the height will be 0, preventing points from moving incorrectly on the y-axis
  labs(x = "Date", # label the x-axis 'Date'
       y = "PM2.5", # label the y-axis 'PM2.5
       color = "Site Location") + # match color to the places in the legend
  scale_color_manual(values = c("Goleta" = "deeppink", "Santa Barbara" = "royalblue")) + # change the default colors, Goleta will be pink and Santa Barbara will be blue
theme_minimal() + # changing default theme to 'minimal'
  theme(legend.position = "none") # remove the legend

e.

ggplot(data = gol_sb, # use data from gol_sb
       aes(sample = pm2_5)) + # use pm2_5 as sample
  geom_qq() + # make a qq plot
  facet_wrap(~ local_site_name) # make two panels to show sensor sites separately

f.

The qq plots do not look normally distributed because there is a strong curve such that the data points don’t fall in line with what would be a normal distribution, meaning that the data is skewed. On each plot, I also notice kurtosis because there is one point on the Goleta graph and two points on the Santa Barbara graph that are really far apart from the rest of the points.

g.

# doing F test to compare if variances are equal
var.test(pm2_5 ~ local_site_name, # compare variances, response variable ~ grouping variable
data = gol_sb) # use data from gol_sb

    F test to compare two variances

data:  pm2_5 by local_site_name
F = 0.42044, num df = 24, denom df = 27, p-value = 0.03524
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.1915806 0.9401427
sample estimates:
ratio of variances 
         0.4204419 

The group variances were not equal (F ratio = 0.42, F(24, 27) = 0.42, p = 0.04).

h.

# doing a two sample t-test, Welch's two sample t-test for unequal variances
t.test(pm2_5 ~ local_site_name, # response variable ~ grouping variable
       data = gol_sb) # data will be used from 'gol_sb' data frame

    Welch Two Sample t-test

data:  pm2_5 by local_site_name
t = -1.4641, df = 46.752, p-value = 0.1499
alternative hypothesis: true difference in means between group Goleta and group Santa Barbara is not equal to 0
95 percent confidence interval:
 -42.158763   6.645906
sample estimates:
       mean in group Goleta mean in group Santa Barbara 
                   31.94000                    49.69643 

i.

Using a t-test is appropriate because we are comparing the data between two groups, the data is continuous, and the groups don’t depend on each other. I evaluated the normality by looking at the qq plot I created, and homogeneity from the box plot. The variable was not normally distributed, but I justified a t-test because of the Central Limit Theorem (25+28>30).

j.

I ran a t-test, which included 28 observations from Santa Barbara and 25 observations from Goleta. The significance level was 0.05 (the test was ran with a 95% confidence interval), the degrees of 46.8, the test statistic was -1.46, and the p-value was 0.1499.

Problem 3. Personal data

my_data_clean <- my_data |> # create new object for clean data
clean_names() # change all names into lowercase and replace spaces with underscores

a.

ggplot(data = my_data_clean, # use 'my_data_clean' data frame
       aes(x = on_the_phone, # x-axis will be information from 'on_the_phone' column
           y = time_it_takes_me_to_walk_to_tunnel_min, # y-axis will be information from 'time_it_takes_me_to_walk_to_tunnel_min' column
           color = on_the_phone)) + # color will be based on 'on_the_phone' column (yes/no)
  geom_boxplot() + # create a box plot
  geom_jitter(position = position_jitter(width = 0.2, # jitter the points, and position the jitter with width 0.2
              height = 0)) + # the height will be 0, preventing points from moving incorrectly on the y-axis) + # add jittered points
  labs(x = "On the Phone (yes/no)", # name the x-axis
       y = "Time To Walk To Tunnel (min)", # name the y-axis
       color = "On the Phone") # name the legend

b.

ggplot(data = my_data_clean, # use 'my_data_clean' data frame
       aes(x = time_before_class_i_leave_min, # x-axis will be information from 'time_before_class_i_leave_min' column
           y = time_it_takes_me_to_walk_to_tunnel_min, # y-axis will be information from 'time_it_takes_me_to_walk_to_tunnel_min' column
           color = time_before_class_i_leave_min)) + # color based on 'time_before_class_i_leave_min' column, the same number of minutes before class that I leave will be the same colors, light colros represent more time and darker colors represent less time
  geom_point(size = 2) + # create scatter plot with point size 2
  labs(x = "Time Before Class I Leave (min)", # name x-axis
       y = "Time To Walk To Tunnel (min)", # name the way axis
       color = "Time Before Class I Leave (min)") # name the legend corresponding to color

c. 

From the box plot with On The Phone vs. Time To Walk To Tunnel, I can see that the mean time it takes me to walk to the tunnel is higher when I am on the phone. There is a smaller range of values for times when I am on the phone, but it may be because I have repeating numbers. The Time Before Class I Leave vs. Time To Walk To Tunnel shows that when I have more time before class when I leave, I usually take longer to walk to class. I think that the figures will change because I will either have more of a variation in the data points I enter or more consistent patterns will be shown. A larger sample size gives a more accurate indication of if the factors are actually related.

d. 

I did not encounter any challenges when transferring my data from Excel into R Studio. I double checked all of my data points and all of the factors I need are in my chart. My only issue, if I had to identify one, is that the headings in my table are kind of long so they were annoying to deal with once my data was in R.

Problem 4. Statistical critique

a.

I am interested in this paper because throughout the environmental studies program, I have learned about disease spillover from animals to humans, and how bats are disease vectors; as humans come in contact more and more with the natural world due to forest fragmentation and other areas of crossover, we are more susceptible to contamination. I hope to go into public health and/or epidemiology in the future, and I take any chance I can to research anything bat and disease related.

b.

The authors are looking at how virus coinfection (where a bat is infected with more than one pathogen at once) impacts the shedding of the Marburg virus in terms of spillover potential by comparing different modes of transmission from bats. The hypothesis was that coinfection impacts virus shedding and can increase/decrease spillover risk.

c. 

The statistical test used were a t-test and a Mann-Whitney U test. The response variable is the shed of the Marburg virus, and the predictor variable is based in the coinfection, testing if the bats are infected with only the Marburg virus (MARV), both Sosuga virus (SOSV) and MARV, or both Kasokero virus (KASV) and MARV.

d. 

The t-test (for disease shedding through saliva) was used to compare the group infected by only MARV and the group coinfected with SOSV and MARV, which had results that showed that the time that the bats could shed MARV when also infected with SOSV was much shorter. The opposite was true for the coinfection of KASV and MARV group, which was compared also against the MARV only group (meaning KASV makes MARV a higher risk and can increase spillover potential). The p-values were both less than 0.05, meaning that the hypothesis was proven to be correct and the results were statistically significant. The Mann-Whitney U tests (for disease shedding through feces) did not have significant results. Because this test did not assume that the distribution was normal, it was a good test to use in addition to the t-test.

e.

Figure 3B and 3D

f. 

The x-axis of figure 3B and 3D have a category for when the bats were coinfected (SOSV + MARV in 3B or KASV + MARV in 3D), only infected with MARV, and the negative control. The y-axis shows the prevalence of the anti-MARV nuceloprotein, IgG. The figures within themselves show the difference in response to coinfection vs. only MARV infection (with the negative control group as a baseline), and the figures compared to each other show the difference between the coinfection viruses compared to MARV.