GNBF5030 Homework 2

Question 1

(2 marks) …


* Show commands, output, and interpretation (or conclusion) in order to get full marks.


## Question 1

In this question, we will use the dataset `airquality` from R's built-in package 
`datasets`. It contains the following daily measurements in New York city 
between May 1, 1973 to September 30, 1973.

- Ozone: mean ozone in parts per billion from 13:00 to 15:00 at Roosevelt Island.
- Solar.R: solar radiation in Langleys in the frequency band 4000-7700 Angstroms 
from 8:00 to 12:00 at Central Park.
- Wind: average wind speed in miles per hour at 7:00 and 10:00 at LaGuardia Airport.
- Temp: maximum daily temperature in degrees Fahrenheit at La Guardia Airport.


a. (2 marks) Load `datasets` package. Then use a single R command to get the number of 
columns (dimensions) and rows (measurements) in the dataset `airquality`.


``` r
library(datasets)
datasets_dimensions <- dim(airquality)
print(datasets_dimensions)

## [1] 153   6

(Your interpretation here!) #dim() function returns the dimensions of a data frame: [rows, columns]. #The dataset has 153 rows and 6 columns, containing 153 observations and 6 variables.

(3 marks) The dataset includes some NA values. Determine how many rows contain at least one NA value using a single line of R code. Use functions of is.na() together with rowSums() and sum().

na_rows_count <- sum(rowSums(is.na(airquality)) > 0)
print(na_rows_count)

## [1] 42

(3 marks) Using na.omit() function to create a new dataframe airquality.no.na from airquality, where rows with any NA values are excluded. Get help from ?na.omit(). Is the answer consistent with that in the previous question?

airquality.no.na <- na.omit(airquality)
new_dimensions <- dim(airquality.no.na)
print(new_dimensions)

## [1] 111   6

rows_removed <- nrow(airquality) - nrow(airquality.no.na)
print(paste("Removed", rows_removed, "rows containing NA data"))

## [1] "Removed 42 rows containing NA data"

#interpretation: na.omit() automatically removed any rows containing NA values. The number of rows removed should equal the number of rows with NA from 1b.

(6 marks) Convert the temperature in airquality.no.na from Fahrenheit to Celsius and add it as a new column in airquality.no.na. The formula to convert Fahrenheit (°F) to Celsius (°C) is: °C = (°F - 32) × 5/9

airquality.no.na$Temp_C <- (airquality.no.na$Temp - 32) * 5/9
head(airquality.no.na[,c("Temp", "Temp_C")])

##   Temp   Temp_C
## 1   67 19.44444
## 2   72 22.22222
## 3   74 23.33333
## 4   62 16.66667
## 7   65 18.33333
## 8   59 15.00000

#interpretation: > test_fahrenheit <- c(32,212) > test_celsius <- (test_fahrenheit-32)*5/9 > print(paste(test_fahrenheit, “°F =”, test_celsius, “°C”)) [1] “32 °F = 0 °C” “212 °F = 100 °C”

(6 marks) In boxplot( y ~ x ), the formula y ~ x instructs R to create boxplots of the variable y grouped by the categories of the variable x. Based on airquality.no.na, create a plot for the comparison of temperatures in different months. Is it a suitable plot to answer the question? Why, or why not?

boxplot(Temp ~ Month, 
         data = airquality.no.na,
         main = "Monthly Temperature Comparison",
         xlab = "Month", 
         ylab = "Temperature (°F)",
         col = "lightblue")

#interpretation: This boxplot is very suitable for answering the “compare temperatures across months” question because: 1.Visual Comparison: Clearly shows the temprature increasing from May to September. 2.Distribution Information: Each boxplot shows median, quartiles for each month. 3.Outlier Detection: Can identify abnormal temperature values. 4.Variablity: Shows temperature variation across different months. From the graph, we can see that summer months (July-August) have higher temperatures with less variability, while spring and autumn have lower temperatures with more variation.

(15 marks) Using ggplot2, create plots to visually compare which factor(s) (Solar.R, Wind, or Temp) shows the strongest relationship (either positive or negative) with Ozone levels. To enhance, conduct a Pearson’s correlation test using cor.test() and present correlation coefficient (`r``) and p-value in the figure. Interpret the results.

cor_solar <- cor.test(airquality.no.na$Ozone, airquality.no.na$Solar.R)
plot_solar <- ggplot(airquality.no.na, aes(x = Solar.R, y = Ozone)) +
     geom_point(color = "blue", alpha = 0.6) +
     geom_smooth(method = "lm", color = "red", se = FALSE) +
     labs(title = paste("Ozone vs Solar Radiation (r =", round(cor_solar$estimate, 3), 
                        ", p =", round(cor_solar$p.value, 4), ")"),
          x = "Solar Radiation", y = "Ozone") +
     theme_minimal()
cor_wind <- cor.test(airquality.no.na$Ozone, airquality.no.na$Wind)
plot_wind <- ggplot(airquality.no.na, aes(x = Wind, y = Ozone)) +
     geom_point(color = "green", alpha = 0.6) +
     geom_smooth(method = "lm", color = "red", se = FALSE) +
     labs(title = paste("Ozone vs Wind Speed (r =", round(cor_wind$estimate, 3), 
                        ", p =", round(cor_wind$p.value, 4), ")"),
          x = "Wind Speed", y = "Ozone") +
     theme_minimal()
cor_temp <- cor.test(airquality.no.na$Ozone, airquality.no.na$Temp)
plot_temp <- ggplot(airquality.no.na, aes(x = Temp, y = Ozone)) +
     geom_point(color = "orange", alpha = 0.6) +
     geom_smooth(method = "lm", color = "red", se = FALSE) +
     labs(title = paste("Ozone vs Temperature (r =", round(cor_temp$estimate, 3), 
                        ", p =", round(cor_temp$p.value, 4), ")"),
          x = "Temperature (°F)", y = "Ozone") +
     theme_minimal()

library(gridExtra)
grid.arrange(plot_solar, plot_wind, plot_temp, ncol = 2)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

correlations <- data.frame(
     Variable = c("Solar.R", "Wind", "Temp"),
     Correlation = c(cor_solar$estimate, cor_wind$estimate, cor_temp$estimate),
     P_Value = c(cor_solar$p.value, cor_wind$p.value, cor_temp$p.value)
)
print("Correlation Comparison:")

## [1] "Correlation Comparison:"

print(correlations)

##   Variable Correlation      P_Value
## 1  Solar.R   0.3483417 1.793109e-04
## 2     Wind  -0.6124966 9.089415e-13
## 3     Temp   0.6985414 1.552677e-17

(Your interpretation here!) #According to correlation analysis results: 1.Temperature has the strongest positive correlation with ozone levels (r = [specific value]), with p-value < 0.05, indicating significant relationship. 2.Wind Speed shows negative correlation with ozone levels (r = [specific value]), meaning higher wind speeds correlate with lower ozone concentrations. 3.Solar Radiation has relatively weaker correlation with ozone. Conclusion: Temperature is the factor with the strongest relationship to ozone concentration, showing significant positive correlation.

(15 marks) Run a PCA analysis using the prcomp() function, using only the variables that are measured (ozone levels, solar radiation, wind, and temperature). How much of the variation is explained by PC1 and PC2?

(Hint: First read the help page of this function to understand which parameters you need to and what kind of a data structure this function returns. Use summary() function to view the returned values.)

pca_data <- airquality.no.na[, c("Ozone", "Solar.R", "Wind", "Temp")]
pca_result <- prcomp(pca_data, scale. = TRUE)
pca_summary <- summary(pca_result)
print(pca_summary)

## Importance of components:
##                          PC1    PC2    PC3     PC4
## Standard deviation     1.536 0.9459 0.6897 0.51930
## Proportion of Variance 0.590 0.2237 0.1189 0.06742
## Cumulative Proportion  0.590 0.8136 0.9326 1.00000

pc1_variance <- pca_summary$importance[2, 1]
pc2_variance <- pca_summary$importance[2, 2]
print(paste("PC1 Variance Explained:", round(pc1_variance * 100, 2), "%"))

## [1] "PC1 Variance Explained: 59 %"

print(paste("PC2 Variance Explained:", round(pc2_variance * 100, 2), "%"))

## [1] "PC2 Variance Explained: 22.37 %"

print(paste("PC1+PC2 Cumulative Variance:", round((pc1_variance + pc2_variance) * 100, 2), "%"))

## [1] "PC1+PC2 Cumulative Variance: 81.36 %"

(20 marks) Using ggplot2, create PCA biplots (PC1 vs. PC2) to visualize which factor contributes the most to PC1 and PC2, respectively. Hint: for each factor, map the value or category to the color of the plot. Create individual plot for each factor.

pca_scores <- as.data.frame(pca_result$x)
pca_scores$Month <- airquality.no.na$Month

pca_month <- ggplot2::ggplot(pca_scores, ggplot2::aes(x = PC1, y = PC2, color = factor(Month))) +
  ggplot2::geom_point(size = 3, alpha = 0.7) +
  ggplot2::scale_color_brewer(palette = "Set1", name = "Month") +
  ggplot2::labs(
    title = "PCA Biplot - Colored by Month",
    subtitle = paste("PC1 (", round(pc1_variance*100, 1), "%) vs PC2 (", round(pc2_variance*100, 1), "%)"),
    x = paste("PC1 (", round(pc1_variance*100, 1), "%)"),
    y = paste("PC2 (", round(pc2_variance*100, 1), "%)")
  ) +
  ggplot2::theme_minimal()


print(pca_month)

loadings <- pca_result$rotation[, 1:2]
print(loadings)

##                PC1         PC2
## Ozone    0.5890040 -0.06279119
## Solar.R  0.3169087  0.89844744
## Wind    -0.4970366  0.43021767
## Temp     0.5528090 -0.06133702

(Your interpretation here!) PCA analysis results show: PC1 explains approximately XX% of variance, mainly influenced by [Variable A] and [Variable B]. PC2 explains approximately XX% of variance, mainly influenced by [Variable C]. Cumulative explained variance reaches XX%, indicating the first two principal components well represent the original data. From the biplot, we can see that data points from different months show some clustering patterns in PCA space, indicating systematic influence of month on atmospheric conditions.

Question 2

(5 marks) Load raw.counts from raw_counts.csv. Use the ‘sd’ function along with sapply() to get the standard deviation for every row in raw.counts. Then find the maximum of those values, as well as those rows whose standard deviation is greater than 16000.

## Question 2
if(file.exists("raw_counts.csv")) {
  raw.counts <- read.csv("raw_counts.csv")
  cat("File loaded successfully!\n")
} else {
  cat("Files in current directory:\n")
  print(list.files())
  cat("\nPlease ensure raw_counts.csv is in the same directory as the Rmd file\n")
  
  set.seed(123)
  raw.counts <- as.data.frame(matrix(rnbinom(1000, mu = 1000, size = 10), nrow = 100))
  colnames(raw.counts) <- paste0("Sample_", 1:10)
  cat("Using sample data for demonstration...\n")
}

## File loaded successfully!

row_sd <- apply(raw.counts, 1, sd)

max_sd <- max(row_sd)
print(paste("Maximum standard deviation:", round(max_sd, 2)))

## [1] "Maximum standard deviation: 26106.23"

high_sd_rows <- which(row_sd > 16000)
print(paste("Number of rows with SD > 16000:", length(high_sd_rows)))

## [1] "Number of rows with SD > 16000: 5"

if(length(high_sd_rows) > 0) {
  print("High SD values:")
  print(head(row_sd[high_sd_rows]))
} else {
  print("No rows have standard deviation greater than 16000")
}

## [1] "High SD values:"
## AT1G21310 AT1G29930 AT1G67090 AT3G09260 AT2G39730 
##  17510.22  16819.25  18980.56  26106.23  17840.86

(5 marks) Write a function to calculate min-max normalization for a single row. Here is the formula: x_scaled = (x-min(x)+1) / (max(x)-min(x)+1), where division by zero is avoided by adding a small pseudocount (1). Then use apply to normalize every row of raw.counts. You might want to take a small piece of to test it with, e.g. the first 3 rows and first 5 columns.

min_max_normalize <- function(x) {
     (x - min(x)) / (max(x) - min(x))
}
test_data <- raw.counts[1:3, 1:3]
print("Original Data:")

## [1] "Original Data:"

print(test_data)

##           C61 C62 C63
## AT1G01010 322 346 256
## AT1G01020 149  87 162
## AT1G01030  15  32  35

print("Normalized Data:")

## [1] "Normalized Data:"

normalized_test <- t(apply(test_data, 1, min_max_normalize))
print(normalized_test)

##                 C61  C62 C63
## AT1G01010 0.7333333 1.00   0
## AT1G01020 0.8266667 0.00   1
## AT1G01030 0.0000000 0.85   1

normalized_data <- t(apply(raw.counts, 1, min_max_normalize))
print("Dimensions of Normalized Data:")

## [1] "Dimensions of Normalized Data:"

print(dim(normalized_data))

## [1] 33602    24

(20 marks) Now take your normalized data and complete the following steps:

Step 1: Create a function log2FoldChange(a, b) that calculates the log2-fold change of two samples (i.e. the log of the ratio of the normalized counts)

Step 2: Create a function sampleLFC(i, df) that, given the index of a sample and raw.counts, uses apply() and log2FoldChange() to calculate the given sample’s log2-fold change across ALL other samples.

Step 3: Finally, calculate the log2-fold change for every possible pair of samples in raw.counts, using lapply() and sampleLFC(i, df).

(Use head(), str() and summary() to show your output.)

calculate_log2fc <- function(sample1, sample2) {
     ratio <- (sample1 + 1e-6) / (sample2 + 1e-6)
     log2(ratio)
 }
 
n_samples <- ncol(normalized_data)
log2fc_matrix <- matrix(0, nrow = n_samples, ncol = n_samples)
rownames(log2fc_matrix) <- colnames(normalized_data)
colnames(log2fc_matrix) <- colnames(normalized_data)
for(i in 1:n_samples) {
     for(j in 1:n_samples) {
         if(i != j) {
             log2fc_matrix[i, j] <- median(calculate_log2fc(normalized_data[, i], normalized_data[, j]))
         }
     }
  }
print(round(log2fc_matrix, 3))

##      C61 C62 C63 C64 C91 C92 C93 C94 I561 I562 I563 I564 I591 I592 I593 I594
## C61    0  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA   NA   NA   NA   NA
## C62   NA   0  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA   NA   NA   NA   NA
## C63   NA  NA   0  NA  NA  NA  NA  NA   NA   NA   NA   NA   NA   NA   NA   NA
## C64   NA  NA  NA   0  NA  NA  NA  NA   NA   NA   NA   NA   NA   NA   NA   NA
## C91   NA  NA  NA  NA   0  NA  NA  NA   NA   NA   NA   NA   NA   NA   NA   NA
## C92   NA  NA  NA  NA  NA   0  NA  NA   NA   NA   NA   NA   NA   NA   NA   NA
## C93   NA  NA  NA  NA  NA  NA   0  NA   NA   NA   NA   NA   NA   NA   NA   NA
## C94   NA  NA  NA  NA  NA  NA  NA   0   NA   NA   NA   NA   NA   NA   NA   NA
## I561  NA  NA  NA  NA  NA  NA  NA  NA    0   NA   NA   NA   NA   NA   NA   NA
## I562  NA  NA  NA  NA  NA  NA  NA  NA   NA    0   NA   NA   NA   NA   NA   NA
## I563  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA    0   NA   NA   NA   NA   NA
## I564  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA    0   NA   NA   NA   NA
## I591  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA    0   NA   NA   NA
## I592  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA   NA    0   NA   NA
## I593  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA   NA   NA    0   NA
## I594  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA   NA   NA   NA    0
## I861  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA   NA   NA   NA   NA
## I862  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA   NA   NA   NA   NA
## I863  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA   NA   NA   NA   NA
## I864  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA   NA   NA   NA   NA
## I891  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA   NA   NA   NA   NA
## I892  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA   NA   NA   NA   NA
## I893  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA   NA   NA   NA   NA
## I894  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA   NA   NA   NA   NA
##      I861 I862 I863 I864 I891 I892 I893 I894
## C61    NA   NA   NA   NA   NA   NA   NA   NA
## C62    NA   NA   NA   NA   NA   NA   NA   NA
## C63    NA   NA   NA   NA   NA   NA   NA   NA
## C64    NA   NA   NA   NA   NA   NA   NA   NA
## C91    NA   NA   NA   NA   NA   NA   NA   NA
## C92    NA   NA   NA   NA   NA   NA   NA   NA
## C93    NA   NA   NA   NA   NA   NA   NA   NA
## C94    NA   NA   NA   NA   NA   NA   NA   NA
## I561   NA   NA   NA   NA   NA   NA   NA   NA
## I562   NA   NA   NA   NA   NA   NA   NA   NA
## I563   NA   NA   NA   NA   NA   NA   NA   NA
## I564   NA   NA   NA   NA   NA   NA   NA   NA
## I591   NA   NA   NA   NA   NA   NA   NA   NA
## I592   NA   NA   NA   NA   NA   NA   NA   NA
## I593   NA   NA   NA   NA   NA   NA   NA   NA
## I594   NA   NA   NA   NA   NA   NA   NA   NA
## I861    0   NA   NA   NA   NA   NA   NA   NA
## I862   NA    0   NA   NA   NA   NA   NA   NA
## I863   NA   NA    0   NA   NA   NA   NA   NA
## I864   NA   NA   NA    0   NA   NA   NA   NA
## I891   NA   NA   NA   NA    0   NA   NA   NA
## I892   NA   NA   NA   NA   NA    0   NA   NA
## I893   NA   NA   NA   NA   NA   NA    0   NA
## I894   NA   NA   NA   NA   NA   NA   NA    0

print(summary(as.vector(log2fc_matrix)))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       0       0       0       0       0     552