MATH50010 Coursework 2023-24

Part I. Loading and exploration

Question 1.

# The name of the data file is called tower_bridge.csv
df <- read.csv("tower_bridge.csv", header = TRUE)

# Display the initial several rows of the data-frame
head(df)

Question 2.

# Split the data into low and high pollution levels
# The name of the level of NO2 in the data-frame is called "Value"
df$state <- ifelse(df$Value > 40, 1, 0)

# Display the new variable `state`
print(df$state)

##   [1] 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
##  [38] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 1 1
##  [75] 0 0 1 0 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [223] 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## [260] 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0
## [297] 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
## [334] 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Question 3.

# Create a new table with state
table_with_state <- table(df$state)

# Calculate the proportion of days in each state using `prop` functions
print(prop.table(table_with_state))

## 
##         0         1 
## 0.8054795 0.1945205

Therefore, by the new table given by the code,

The proportion of days with high pollution is: \(0.1945205 \approx \%19.45\)
The proportion of days with low pollution is: \(0.8054795 \approx \%80.55\)

Part II. A Markov Chain Model

We will now model the data as a Markov Chain.

Question 4.

# Count the number of transitions for each possible pairs
counts_transitions <- table(df$state[-length(df$state)], df$state[-1])

# Display the transition counts table
print(counts_transitions)

##    
##       0   1
##   0 258  35
##   1  35  36

Question 5.

# We could use the `prop` function to calculate the probabilities

# Create and estimate the transitions matrix of chain
transitions_matrix <- prop.table(counts_transitions, margin = 1)

# Display the transition matrix
print(transitions_matrix)

##    
##             0         1
##   0 0.8805461 0.1194539
##   1 0.4929577 0.5070423

Question 6.

We define a new function that simulates draws of length m from a two state Markov chain with states \(0\) and \(1\).

# The input of the function should be the `transitions_matrix` and the length `m`
simulating_markov_chain <- function (transitions_matrix, m) {
  states <- rep(0, times = m) # This is to set the total length of the Markov chain
  states[1] <- sample(0:1, 1, prob = c(0.5, 0.5)) # This is the initial states of the chain
  
# Simulate a Markov chain sequence 
  for (i in 2:m) {
      states[i] <- sample(0:1, 1, prob = transitions_matrix[states[i-1] + 1, ])
}
return(states)
}

Question 7.

1. Compute the estimates of the the transition probabilities

n = 1000 # Assume we simulate the process for a thousand independent `years`
m = 365 # There are 356 days in a year, usually

# Initialize a matrix
estimates <- matrix(0, nrow = n, ncol = 4)

# Simulate n years of daily high/low classifications

for (i in 1:n) {
  # Simulate a Markov Chain sequence using the function that defined before
  simulated_sequence <- simulating_markov_chain(transitions_matrix, m)
  # Count the number of transitions in the simulated sequence
  transitions_counts_estimated <- table(simulated_sequence[-length(simulated_sequence)], 
                                      simulated_sequence[-1])
  # Compute the estimates of the transition probabilities for each year
  
  # Estimate the transition probabilities of the counts
  transitions_matrix_estimated <- prop.table(transitions_counts_estimated, margin = 1)
  # Transfer the estimated transition probabilities into a matrix
  estimates[i, ] <- as.vector(transitions_matrix_estimated)
}

colnames(estimates) <- c('(0, 0)', '(1, 0)', '(0, 1)', '(1, 1)')

2. Check if the estimators are approximately unbiased

# Compute the bias
means <- colMeans(estimates)
bias <-  means - transitions_matrix

# Print the bias
print(bias)

##    
##                0            1
##   0 -0.001516454  0.001516454
##   1  0.008347315 -0.008347315

# Check if the estimators are approximately unbiased
if (all(abs(bias) < 0.05)) {
  cat("The estimators are approximately unbiased.")
} else {
  cat("The estimators are not approximately unbiased.")
}

## The estimators are approximately unbiased.

3. Check if the estimates of different transition probabilities are correlated

# Create the correlations matrix
correlations_matrix <- cor(estimates)

# Display the correlations matrix
print(correlations_matrix)

##              (0, 0)       (1, 0)       (0, 1)       (1, 1)
## (0, 0)  1.000000000  0.007790145 -1.000000000 -0.007790145
## (1, 0)  0.007790145  1.000000000 -0.007790145 -1.000000000
## (0, 1) -1.000000000 -0.007790145  1.000000000  0.007790145
## (1, 1) -0.007790145 -1.000000000  0.007790145  1.000000000

# Check if the estimates of different transitions probabilities correlated
if (any(abs(correlations_matrix) > 0.7)) {
  cat("There are some highly correlated transition probabilities.")
} else {
  cat("There are no highly correlated transition probabilities.")
}

## There are some highly correlated transition probabilities.

Part III. Testing the Markov Model

Question 8.

We introduce the following notations to denote the the probability of the all \(4\) kinds of situation.

\(P_{(L,L)}\): The probability that the transitioning from low pollution to low pollution.
\(P_{(L,H)}\): The probability that the transitioning from low pollution to high pollution.
\(P_{(H,L)}\): The probability that the transitioning from high pollution to low pollution.
\(P_{(H,H)}\): The probability that the transitioning from high pollution to high pollution.

The following is a formal hypothesis test in terms of the transition probabilities to test whether the probability of the pollution level being high is independent of the pollution level of the previous day.

Null Hypothesis (\(H_0\)): The probability of the pollution level being high is independent of the pollution level of the previous day. i.e.

\[ P_{(L, H)} = P_{(H, H)} \]

Alternative Hypothesis (\(H_1\)): The probability of the pollution level being high is dependent on the pollution level of the previous day.

\[ P_{(L, H)} \neq P_{(H, H)} \]

Question 9.

s_A_high <- counts_transitions[2, 2]
s_B_high <- counts_transitions[1, 2]
n_A_high <- counts_transitions[1, 2] + counts_transitions[2, 2]
n_B_high <- counts_transitions[1, 1] + counts_transitions[2, 1]

# Calculate the probability P(X <= s_A) 
P_l_high <- phyper(s_A_high, n_A_high, n_B_high, s_A_high + s_B_high)
# Calculate the probability P(X => s_A) 
P_g_high <- phyper(s_A_high - 1, n_A_high, n_B_high, s_A_high + s_B_high, lower.tail = FALSE)

# Calculate the p-value for being the high level pollution
p_value_high <- 2 * min(P_l_high, P_g_high)

# Display the p-value for being the high level pollution
cat("The p-value for the high level pollution is:", p_value_high, "\n")

## The p-value for the high level pollution is: 2.402968e-11

Question 10.

if (p_value_high < 0.05) {
  cat("Reject null hypothesis (H_0). There is some evidence that the probability of the po
      llution level being high is dependent on the pollution level of the previous day. \n")
} else {
  cat("Fail to reject null hypothesis (H_1). It is not evident enough the probability of 
      the pollution level being high is independent of the pollution level of the previous day. \n")
}

## Reject null hypothesis (H_0). There is some evidence that the probability of the po
##       llution level being high is dependent on the pollution level of the previous day.

Question 11.

We use the same notations as in Question 8

The following is a similar hypothesis test to determine whether the probability of the pollution level being low is independent of the previous days level.

Null Hypothesis (\(H_0\)): The probability of the pollution level being low is independent of the pollution level of the previous day. i.e.

\[ P_{(L, L)} = P_{(H, L)} \]

Alternative Hypothesis (\(H_1\)): The probability of the pollution level being low is dependent on the pollution level of the previous day.

\[ P_{(L, L)} \neq P_{(H, L)} \]

s_A_low <- counts_transitions[1, 1]
s_B_low <- counts_transitions[2, 1]
n_A_low <- counts_transitions[1, 1] + counts_transitions[2, 1]
n_B_low <- counts_transitions[2, 2] + counts_transitions[1, 2]


# Calculate P(X <= s_A_low)
P_1_low <- phyper(s_A_low, n_A_low, n_B_low, s_A_low + s_B_low)
# Calculate P(X >= s_A_low)
P_2_low <- phyper(s_A_low-1, n_A_low, n_B_low, s_A_low + s_B_low, lower.tail = FALSE)

# Calculate the p-value for being the low level pollution
p_value_low = 2*min(P_1_low, P_2_low)

# Display the p-value for being the low level pollution
cat("The p-value for being the low level pollution is:", p_value_low, "\n")

## The p-value for being the low level pollution is: 2.402968e-11

if (p_value_low < 0.05) {
  cat("Reject null hypothesis (H_0). There is some evidence that the probability of the
      pollution level being low is dependent on the pollution level of the previous day. \n")
} else {
  cat("Fail to reject null hypothesis (H_1). It is not evident enough the probability of 
      the pollution level being low is independent of the pollution level of the previous day. \n")
}

## Reject null hypothesis (H_0). There is some evidence that the probability of the
##       pollution level being low is dependent on the pollution level of the previous day.

Part IV. Investigating the Number of Consecutive Low Pollution Days

We now want to investigate the likelihood of consecutive low pollution days.

Question 12.

#Calculate the probability
P_HL <- transitions_matrix[2, 1]
P_LL <- transitions_matrix[1, 1]

# Calculate P(X_1=0,..., X_7=0 | X_0=1) = P(X_7 = 0 | X_6 = 0)...P(X_1 = 0 | X_0 = 1)
P_L_a_week <- P_HL * prod(rep(P_LL, 6))


# Print the probability

cat("The probability that after any high pollution day, it is over a week until the next
    high pollution day is", P_L_a_week, "\n")

## The probability that after any high pollution day, it is over a week until the next
##     high pollution day is 0.2297853

Question 13.

# Define a new function 

Prob_m <- function(m ,transitions_matrix) {
  P_HL <- transitions_matrix[2, 1]
  P_LL <- transitions_matrix[1, 1]
  
# Calculate P(M = m | X_0 = 1) = (P(X_n = 0 | X_{n-1} = 0))^(m - 1) P(X_1 = 0 | X_0 = 1)
  prob <- P_HL * prod(rep(P_LL, m-1))
  return(prob)
}

# Assume there are 365 consecutive low pollution days starting from the first day
x_values = 1 : 365

# Compute the probability in the following 364 days
y_values <- sapply(x_values, function(m) Prob_m(m, transitions_matrix))

# Plot the probability mass function
plot(x_values, y_values, type = "b", pch = 16, col = "green", xlab = "Number of consecut
     ive low pollution days", ylab = "Probability", main = "Probability mass function of 
     the number of consecutive low pollution days")

Question 14.

There is uncertainty in the question due to observations made from the tower_bridge.csv file. The initial days and the final days of the observation period exhibit low pollution levels. However, we cannot be certain whether the day before January 1st, 2022, had high pollution levels or not, and similarly, we cannot be certain whether the day after December 31st, 2022, had high pollution levels or not. Therefore, we consider the following two situations:

Situation 1: Count the number of days with low pollution levels at the beginning and end of the observation period.

The question could be transferred as following: There is an array composed of \(1\) and \(0\).Find the average length of consecutive \(0\)(s) between \(1\)(s).

We could define a new function to solve the transferred problem, whose input is a data-frame and the column name of this data-frame.

average_length_of_zeros <- function(df, column_name) {
  # Extract the column from the data-frame
  arr <- df[[column_name]]

  # Initialize variables
  count <- 0
  total_length <- 0
  num_sequences <- 0
  start_sequence <- 0

  # Check if the array starts with 0s
  if (arr[1] == 0) {
    start_sequence <- 1
  }

  # Iterate over the array
  for (i in arr) {
    if (i == 0) {
      count <- count + 1  # Count the zeros
    } else if (i == 1) {
      if (count > 0 | start_sequence == 1) {
        total_length <- total_length + count  # Add the length of the zero sequence
        num_sequences <- num_sequences + 1  # Count the number of zero sequences
        count <- 0  # Reset the zero counter
      }
      start_sequence <- 0
    }
  }

  # Check if the array ends with 0s
  if (arr[length(arr)] == 0 & count > 0) {
    total_length <- total_length + count
    num_sequences <- num_sequences + 1
  }

  # Calculate the average length of zero sequences
  if (num_sequences > 0) {
    average_length <- total_length / num_sequences
  } else {
    average_length <- 0
  }

  return(average_length)
}

average <- average_length_of_zeros(df, "state")

# Print the average number of consecutive low pollution days
cat("The average number of consecutive low pollution days between high pollution days is:
    ", average, "\n")

## The average number of consecutive low pollution days between high pollution days is:
##      8.166667

Situation 2: Do not count the number of days with low pollution levels at the beginning and end of the observation period.

We define a new function to solve this question, whose input is a data-frame and a column of this data-frame. But it will distract the number of 0(s) if it starts at beginning. It will also distract the number of 0(s) if it ends at the last.

average_length_of_zeros <- function(df, column_name) {
  # Extract the column from the data-frame
  arr <- df[[column_name]]

  # Initialize variables
  count <- 0
  total_length <- 0
  num_sequences <- 0
  start_sequence <- 0

  # Iterate over the array
  for (i in arr) {
    if (i == 0 && start_sequence == 1) {
      count <- count + 1  # Count the zeros
    } else if (i == 1) {
      if (count > 0) {
        total_length <- total_length + count  # Add the length of the zero sequence
        num_sequences <- num_sequences + 1  # Count the number of zero sequences
      }
      count <- 0  # Reset the zero counter
      start_sequence <- 1
    }
  }

  # Calculate the average length of zero sequences
  if (num_sequences > 0) {
    average_length <- total_length / num_sequences
  } else {
    average_length <- 0
  }

  return(average_length)
}


average <- average_length_of_zeros(df, "state")

# Print the average number of consecutive low pollution days
cat("The average number of consecutive low pollution days between high pollution days is:
    ", average, "\n")

## The average number of consecutive low pollution days between high pollution days is:
##      8.088235

There is also a one line solution using the rle function in R for Situation 1.

# Calculate the average number
consecutive_low_days <- rle(df$state)$lengths[which(rle(df$state)$values == 0)]
average_consecutive_low_days <- mean(consecutive_low_days)

# Print the average number of consecutive low pollution days
cat("The average number of consecutive low pollution days between high pollution days is:
    ", average_consecutive_low_days, "\n")

## The average number of consecutive low pollution days between high pollution days is:
##      8.166667

If the answer of this quesiton should be an integer (since it is the number of days), then the approximated answer of the average number of consecutive low pollution days between high pollution days should be \(8\).

Part V. Conclusion

Problem 15.

The following are some potential limitations of the present study:

Data limitations: The data utilized in this research is confined to a single location (Tower Bridge Road in London) and a single year (2022). Given that pollution levels can fluctuate significantly based on location and time frame, the findings may not be applicable to other locations or time frames.
Markov Assumption: This research presumes that the pollution level on a given day is solely dependent on the pollution level of the preceding day, a property known as the Markov property. However, pollution levels could be influenced by a multitude of factors such as weather conditions, traffic volume, time of year, etc. Neglecting these factors might oversimplify the model and affect the accuracy of the predictions.
Binary State: Pollution levels are categorized into two states in this study: high and low. This binary categorization might oversimplify the data and ignore important variations within each state. For instance, a pollution level of 41 µg/m³ is considered the same as a level of 100 µg/m³, even though the latter is much more harmful.
Independence of Days: The study assumes that the pollution levels of different days are independent given the pollution level of the previous day. However, this might not be the case in reality. For example, a series of low wind speed days could lead to a buildup of pollution over several days.
Time Homogeneity: The study assumes the time homogeneous across time periods. The formula \(P(X_n = j\mid X_{n - 1} = i) = P(X_1 = j \mid X_0 = j)\) may not always be true due to the time change with the climate and season change.
Model Validation: While the study uses the data to estimate the transition probabilities and simulate the Markov chain, it does not validate the model against actual data. Therefore, it’s unclear how well the model predicts real-world pollution levels.
Ignoring Overlaps: In counting the number of pairs of successive states, overlaps are considered. For example, the sequence 0100 corresponds to one (0,1) transition, one (1,0) transition, and one (0,0) transition. This might lead to an overestimation of transitions.

These limitations suggest that while the Markov model provides a useful tool for analyzing pollution data, it should be used with caution and its results should be interpreted in the context of its assumptions and limitations. Further studies could consider incorporating more factors into the model, using a more nuanced categorization of pollution levels, validating the model against actual data, and exploring other statistical models.