HW3

Group:

Ayelet Shani, 207207218, ayelet.shani@mail.huji.ac.il
Noam Salomon, 206481152, noam.salomon1@mail.huji.ac.il
Aviv Gelfand, 206958068, Aviv.gelfand@mail.huji.ac.il

Libraries used

if (!require('data.table')){
  install.packages('data.table')
  library('data.table')
}

## Loading required package: data.table

if (!require('manipulate')){
  install.packages('manipulate')
  library('manipulate')
}

## Loading required package: manipulate

# install.packages("Rcpp")
library('Rcpp')
library('ggplot2')

library(data.table)

Paths and Data

#reads_file_noam = "G:/My Drive/שנה ג/סמסטר ב/מעבדה/מטלה 2/TCGA-13-0723-01A_lib2_all_chr1.forward"
reads_file_noam = "TCGA-13-0723-01A_lib2_all_chr1.forward"

#reads_file_Ayelet = "C:/Users/User/Documents/לימודים/תואר1/שנה3/סטטיסטיקה/מעבדה/lab2/TCGA-13-0723-01A_lib2_all_chr1.forward"

chr1_reads = fread(reads_file_noam)
colnames(chr1_reads) = c("Chrom", "Loc", "FragLen")
head(chr1_reads)

##    Chrom   Loc FragLen
##    <int> <int>   <int>
## 1:     1    23     198
## 2:     1    25     262
## 3:     1    27     200
## 4:     1    31     326
## 5:     1    32     197
## 6:     1    32     199

“Fast” function from Lab 2

Rcpp::cppFunction('
NumericVector getReadLineCpp(NumericVector locations, int beg_region, int end_region) {
  int N = end_region - beg_region + 1;
  NumericVector read_line(N);
  
  int r = 0;
  while (r < locations.size() && locations[r] < beg_region) {
    r++;
  }
  
  while (r < locations.size() && locations[r] <= end_region) {
    read_line[locations[r] - beg_region] += 1;
    r++;
  }
  
  return read_line;
}')

# define the regions
beg_region = 1
end_region = 20000000
read_line = getReadLineCpp(chr1_reads$Loc, beg_region, end_region)

Part a Qa

We calculate the expected distribution of read counts based on the Poisson model using the calculated average coverage as \(λ\). We then compare this expected distribution with the observed distribution.

# calculate the average coverage (mean of read counts)
lambda <- mean(read_line)

# calculate the observed distribution
observed_distribution <- table(factor(read_line, levels = 0:max(read_line)))

# create a table for values 0-5 and >5, calculate the corresponding expected poisson values

observed_table <- c(observed_distribution[1:6], sum(observed_distribution[7:length(observed_distribution)]))
names(observed_table) <- c(0:5, ">5")
expected_probs <- dpois(0:5, lambda)
expected_probs <- c(expected_probs, 1 - sum(expected_probs)) # Probability for >5

total_bases <- length(read_line)
expected_counts <- expected_probs * total_bases

# create a table for expected counts, combine observed and expected tables
expected_table <- data.frame(Values = c(0:5, ">5"), Expected = round(expected_counts, 2))
comparison_table <- data.frame(Values = c(0:5, ">5"), Observed = round(observed_table, 2), Expected = round(expected_counts, 2))

# print the table
print(comparison_table)

##    Values Observed    Expected
## 0       0 18279693 18241434.43
## 1       1  1610109  1678880.52
## 2       2   103067    77259.27
## 3       3     6187     2370.23
## 4       4      581       54.54
## 5       5      135        1.00
## >5     >5      228        0.02

When looking at the comparison for single locations, there seems to be somewhat of a fit with differences only coming in at the edge values. We continue to plot the observed and expected distributions to visually compare them. This helps in understanding how well the Poisson model fits the observed data, we also used log scale so that the 0 isn’t so dominant and makes all other values seem very low.

# Create the barplot
barplot(rbind(observed_table, expected_table$Expected), beside = TRUE, 
        col = c("skyblue", "orange"), legend = c("Observed", "Expected"),
        args.legend = list(x = "topright"),
        main = "Log Scaled Observed vs Expected Poisson Distribution", 
        xlab = "Counts", ylab = "Frequency", log = "y")

## Warning in yinch(0.1): y log scale: yinch() is nonsense

# Add horizontal gridlines
grid(nx = NA, ny = NULL) # nx=NA to skip vertical grid lines, ny=NULL for default horizontal grid lines

The plot further supports the notion that that the poisson model is a fairly good fit when looking at each base individually, apart from the >5 base that is “noised” by a

Part a, Q.b

Fragments sum function from Lab 2.

# Function to sum fragments in bins of 10k size
sumFragmentsInBins = function(locations, beg_region, end_region, bin_size = 10000) {
  
  # Get read_line vector
  read_line = getReadLineCpp(locations, beg_region, end_region)
  
  # Number of bins
  num_bins = ceiling((end_region - beg_region + 1) / bin_size)
  print(paste("num of bins: ",num_bins))
  
  # Initialize bin_counts vector
  bin_counts = numeric(num_bins)

  # Sum fragments in each bin
  for (i in 1:num_bins) {
    bin_start = (i - 1) * bin_size + 1
    bin_end = min(i * bin_size, length(read_line))
    bin_counts[i] = sum(read_line[bin_start:bin_end])
  }
  
  return(bin_counts)
}

Using the guidance provided in the instructions to do some filtering first, using IQR method.

beg_region <- 1
end_region <- 20000000
bin_size <- 10000
bin_counts <- sumFragmentsInBins(chr1_reads$Loc, beg_region, end_region, bin_size)

## [1] "num of bins:  2000"

# calculate outlier cutoffs using the IQR method as Yuval showed in class.
q1 <- quantile(bin_counts, 0.25)
q3 <- quantile(bin_counts, 0.75)
iqr <- q3 - q1
bottom_cut <- q1 - 1.5 * iqr
top_cut <- q3 + 1.5 * iqr

# exclude outliers 
filtered_counts <- bin_counts[bin_counts >= bottom_cut & bin_counts <= top_cut]

Part a. Q.b section a

# calculate average coverage rate (λ)
lambda <- mean(filtered_counts)

# robust measures of center and spread
median_count <- median(filtered_counts)
iqr_counts <- IQR(filtered_counts)

# spread comparison
observed_variance <- var(filtered_counts)
expected_variance <- lambda  # For a Poisson distribution, variance is also lambda

# output robust measures and dispersion comparisons
cat("Robust measures of center and spread:\n")

## Robust measures of center and spread:

cat("Median count per bin:", median_count, "\n")

## Median count per bin: 1020

cat("IQR of counts per bin:", iqr_counts, "\n\n")

## IQR of counts per bin: 186

cat("Spread comparison:\n")

## Spread comparison:

cat("Observed variance of counts per bin:", observed_variance, "\n")

## Observed variance of counts per bin: 21900.06

cat("Expected variance under uniform Poisson model:", expected_variance, "\n")

## Expected variance under uniform Poisson model: 994.3181

The observed variance is much higher than the expected variance under the Poisson model. this may be because reads might be clustered in certain regions, leading to higher variability. Also, there might be underlying heterogeneity in the process generating the reads.

Part a. Q.b section b

rounded_counts <- round(filtered_counts)
# convert rounded counts to a data frame to plot with ggplot
df <- data.frame(counts = rounded_counts)

# calculate Poisson probabilities
x_vals <- 0:max(rounded_counts)
expected_probs <- dpois(x_vals, lambda)
expected_counts <- data.frame(counts = x_vals, probability = expected_probs)


# density plot with Poisson overlay 
ggplot(df, aes(x = counts)) +
  geom_density(aes(y = ..density.., fill = "Observed"), alpha = 0.5) +
  stat_function(fun = function(x) dpois(round(x), lambda), aes(color = "Poisson"), size = 1) +
  labs(title = "Density Plot of Bin Counts", x = "Number of Fragments per Bin", y = "Density") +
  scale_fill_manual(name = "Distribution", values = c("Observed" = "skyblue")) +
  scale_color_manual(name = "Distribution", values = c("Poisson" = "red")) +
  theme_minimal() +
  theme(legend.position = "right")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

When looking at bin analysis for bins of size 10000, it doesn’t seem like the poisson model is a very good fit, whilst there may be areas in which they do align, the overall distributions do not fit each other, even though it seems their respective centers do align, somewhere around 1000. in the next section we perform chi-squared test and K-S test to try to test the fit in numeric methods.

Part a. Q.b section c

# adjust bins for chi-square test, otherwise the test wont work since we'll have bins with values of 0, due to the nature of the coverage.
threshold <- 5
observed_counts <- table(factor(rounded_counts, levels = x_vals))
expected_counts <- dpois(x_vals, lambda) * length(rounded_counts)

# combine bins with expected counts less than the threshold
combined_observed <- observed_counts
combined_expected <- expected_counts

low_freq_bins <- which(expected_counts < threshold)
combined_observed[low_freq_bins[1]] <- sum(observed_counts[low_freq_bins])
combined_expected[low_freq_bins[1]] <- sum(expected_counts[low_freq_bins])
combined_observed <- combined_observed[-low_freq_bins[-1]]
combined_expected <- combined_expected[-low_freq_bins[-1]]

# perform chi-square test
chisq_test <- chisq.test(combined_observed, p = combined_expected / sum(combined_expected), rescale.p = TRUE)
print(chisq_test)

## 
##  Chi-squared test for given probabilities
## 
## data:  combined_observed
## X-squared = 9457.3, df = 109, p-value < 2.2e-16

The large value of the chi-squared statistic (9457.3) indicates that the observed frequencies deviate substantially from the expected frequencies under the Poisson model. Furthermore, the small p-value (< 2.2e-16) indicates that this deviation is significant. So, the null hypothesis (that the observed data follows a Poisson distribution) is rejected.

# Perform the Kolmogorov-Smirnov test
ks_test <- ks.test(rounded_counts, "ppois", lambda)

## Warning in ks.test.default(rounded_counts, "ppois", lambda): ties should not be
## present for the one-sample Kolmogorov-Smirnov test

# Output the test results
print(ks_test)

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  rounded_counts
## D = 0.37188, p-value < 2.2e-16
## alternative hypothesis: two-sided

similar to the chi-squared test, the p-value of the K-S test indicates that the observed distribution is not well-described by a Poisson model.

Part b. Q.a section a

Counting C and G in the cells as in Lab 1 with a function.

# Load the RDA file
#load("C:/Users/User/Documents/לימודים/תואר1/שנה3/סטטיסטיקה/מעבדה/lab3/chr1_line.rda")
load("chr1_line.rda")

# The indexes of our chosen area
start_pos <- 1           
end_pos <- 50000000  
cell_size <- 10000

Making a function for counting C + G bases in each cell at our chosen region.

chosen_region <- chr1_line[start_pos:end_pos]

# Function to count C and G bases in each cell
count_CG_in_cells <- function(sequence, cell_length) {
  num_cells <- ceiling(length(sequence) / cell_length)
  CG_counts <- numeric(length = num_cells)
  
  for (cell_index in seq_len(num_cells)) {
    start_index <- (cell_index - 1) * cell_length + 1
    end_index <- min(cell_index * cell_length, length(sequence))
    cell_sequence <- sequence[start_index:end_index]
    
    counts <- sum(cell_sequence == "C") + sum(cell_sequence == "G")
    CG_counts[cell_index] <- counts  # Storing the sum directly into the vector

  }
  
  return(CG_counts)
}

# Get the counts of C and G bases in each cell
CG_counts_per_cell <- count_CG_in_cells(chosen_region, cell_size)
print(length(CG_counts_per_cell))

## [1] 5000

Showing the counts for each location.

# Create a data frame for plotting
CG_counts_df <- data.frame(
  Cell = seq_along(CG_counts_per_cell),
  CG_Counts = CG_counts_per_cell
)

ggplot(CG_counts_df, aes(x = Cell, y = CG_Counts)) +
  geom_point(color = "blue") +
  labs(title = "Sum of C + G Bases Along the Sequence",
       x = "Cell Index",
       y = "Sum of C + G Bases") +
  theme_minimal()

## Warning: Removed 67 rows containing missing values or values outside the scale range
## (`geom_point()`).

Part b. Q.a section b

Summing up the reads:

getReadLine <- function(locations, beg_region, end_region, cell_size) {
  # Filter locations within the specified region
  good_loc <- locations[locations >= beg_region & locations <= end_region]
  
  # Create a data.table for efficient aggregation
  dt <- data.table(location = good_loc)
  
  # Count the number of fragments starting at each position
  cnt_frag <- dt[, .N, by = location]
  
  # Initialize the line vector with zeros for each cell
  num_cells <- ceiling((end_region - beg_region + 1) / cell_size)
  line <- integer(num_cells)
  
  # Iterate over each cell to sum the counts
  for (i in 1:num_cells) {
    cell_start <- beg_region + (i-1) * cell_size
    cell_end <- min(beg_region + i * cell_size - 1, end_region)
    
    # Sum the counts in the current cell
    line[i] <- sum(cnt_frag$N[cnt_frag$location >= cell_start & cnt_frag$location <= cell_end])
  }
  
  return(line)
}



read_line = getReadLine(chr1_reads$Loc, start_pos, end_pos, cell_size)

# Create a data frame for coverage and total GC counts
data <- data.frame(count = read_line, GC_Countent = (CG_counts_per_cell/cell_size))

data <- na.omit(data)

# Check for correlation
correlation <- cor(data$count, data$GC_Countent)

# Print the correlation coefficient
print(paste("The Pearson correlation is:", correlation))

## [1] "The Pearson correlation is: 0.164595719990672"

There is a minor positive Pearson Correlation between the number of reads in the cell and the percentage of C and G in the cell. But perhaps there is a non-linear connection.

Part b. Q.a section c

# Plot the data with a smoothed regression line
ggplot(data, aes(x = GC_Countent, y = count)) +
  geom_point() +  # Add points
  geom_smooth(method = "loess", se = FALSE) +  # Add trend line without confidence interval
  labs(title = "G + C Bases Rate vs. Number of Reads", x = "G + C Bases", y = "Number of Reads") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Peak and Decline: The peak in the number of reads suggests that there is an optimal range of G + C bases (around 4500) that results in the maximum number of reads. Beyond this optimal range, the number of reads decreases, indicating a possible negative impact of higher G + C bases on the number of reads.

The relationship between the G + C bases rate and the number of reads is non-linear. There is an optimal G + C bases range around 4500 where the number of reads is maximized. Both lower and higher G + C bases rates are associated with fewer reads, suggesting a quadratic correlation.

The plot in Dohm’s article shows a linear relationship, where an increase in GC-content consistently leads to an increase in the number of reads per kbp. But they range only till 50% in X acis. In fact, both of the relationships are similar in the range of 0-50% ratio of G+C bases.

Part b. Q.a section d

Deriving a conclusion in relation to our founding in Section A, there is a a potential bias towards G+C Bases in the count of readings that potentially affects the distribution of number of fragment readings in the methodology of reading the genes.

HW3

Aviv, Ayelet & Noam

01 June 2024