ENV221 23-24 Final Exam

Author

Ziqi Zhao

Task 1 Environmental Data Analysis with “airquality” Dataset. (60 points)

You are an environmental scientist analyzing air quality data for New York City during the summer of 1973 using the built-in R dataset “airquality”. Your goal is to perform various data analysis tasks and visualizations on this dataset to gain insights and draw meaningful conclusions.

Subtask 1

Load the airquality dataset. Calculate the average ozone concentration (Ozone) for the entire dataset. Store the result in the variable named mean_ozone. Print out mean_ozone. Note: remove NA is needed. (4 points)

# Please insert your code here

Subtask 2

Find the row with the highest ozone concentration and store the result in the variable named max_ozone_day. Identify the corresponding weather conditions (Solar.R, Temp, Wind) of that day and store the results in the variable named max_ozone_conditions. Print out max_ozone_conditions. (8 points)

# Please insert your code here

Subtask 3

Calculate the correlation matrix between Ozone, Solar.R, Temp, and Wind. Store the result in the variable named cor_matrix. Print out cor_matrix. (6 points)

# Please insert your code here

Subtask 4

Create a scatter plot matrix to visualize the relationships between Ozone and weather variables. Give the R code to create the scatter plot. (8 points)

# Please insert your code here

Subtask 5

Calculate the 95% confidence interval for the mean Ozone concentration. (6 points)

# Please insert your code here

Subtask 6

Perform a hypothesis test to investigate whether there is a significant difference in ozone concentration between the months of June and July at the significance level of 0.05. (24 points)

  1. What hypothesis test should you choose from those introduced in this module? (4 points)

    Provide your answer here:

    Note
  2. What are your null hypothesis and alternative hypothesis? (4 points)

    Provide your answer here:

    Note
  3. Give the R code to apply the hypothesis test. (4 points)

    # Please insert your code here
  4. Give the value of the test statistic. (4 points)

    # Please insert your code here
  5. Give the p-value. (4 points)

    # Please insert your code here
  6. Give the decision about the hypothesis. (4 point)

    Provide your answer here:

    Note
  7. Give the final conclusion. (4 points)

    Provide your answer here:

    Note

Subtask 7

Create a histogram of Wind speed (Wind) with appropriate labels and titles. (4 points)

# Please insert your code here

Task 2 Step by step hypothesis test (40 points)

Effluents from wastewater treatment plants contain nutrients, organic and inorganic pollutants, which is an important source of urban river pollution. Such pollution issue is highly relevant to organisms living in the rivers as well as human health. To answer whether effluents from wastewater treatment plants influence microorganisms in urban rivers, a research group collected water samples from a river in Suzhou, both upstream (before receiving effluent) and downstream (after receiving effluent) of a municipal wastewater treatment plant. The measured number of bacterial species in the water samples obtained in different seasons are as follows.

Subtask 1

What hypothesis test should you choose to estimate how wastewater treatment plant effluent, season, and interaction between these two factors affect number of bacterial species?

Provide your answer here:

Note

Subtask 2 (Continued)

What are your three null hypotheses?

Provide your answer here:

Note

H01:

H02:

H03:

Subtask 3 (Continued)

Apply the hypothesis test STEP BY STEP by answering the following questions. Give the reproducible R code to calculate the between-group degree of freedom for effluent (1point), season (1 point), and interaction between effluent and season (1 point), and within-group degree of freedom (1 point).

# Insert your code here

Subtask 4 (Continued)

Give the reproducible R code to calculate the mean of squared deviation from the mean for effluent (2 points), season (2 points), and interaction between effluent and season (2points).

# Insert your code here

Subtask 5 (Continued)

Give the reproducible R code to calculate the value of the test statistic for effluent (1 point), season (1 point), and interaction between effluent and season (1 point).

# Insert your code here

Subtask 6 (Continued)

At the significance level of 0.05, give the critical value of the test statistic for effluent (1 point), season (1 point), and interaction between effluent and season (1 point).

# Insert your code here

Subtask 7 (Continued)

Give the reproducible R code to calculate the p value for effluent (2 point), season(2 point), and interaction between effluent and season (2 point).

# Insert your code here

Subtask 8 (Continued)

Give the decisions about your hypotheses.

Provide your answer here:

Note

Subtask 9 (Continued)

What are your conclusions?

Provide your answer here:

Note