Lab 3: Simple Comparative Experiments

Author

Collin Lowery

Published

September 8, 2024

Part I: Introduction

This Investigation examines experimental bridge data from 1993 and 2023. Using these two data sets, the condition of the deck, superstructure, and the substructure of bridges are compared. First, normal probability plots are plotted using a transformed scale followed by Q-Q plots to gain an understanding of the normality of the data and the variance differences. Next, an F-test is performed to check the variances between the data sets. Finally, thee hypothesis tests are completed using pooled t-test, Welch t-test, and a paired t-test. Following these hypothesis tests, the results are discussed.

Part II: Understand Data

The data set for this investigation is from the FHA’s National Bridge Inventory Administration (2024a) (NBI) data set. The NBI data includes information about bridges across the United States (US) including state and county information, global coordinates, the year of construction, amount of traffic, type of structure, bridge span, and more. This work investigates information on the bridge deck, superstructure, and substructure condition for the years 1993 and 2023. The data set for 1993 consists of 25 columns and 668,434 rows whereas the data set for 2023 contains 33 columns and 621,547 rows.

The columns of the data set include different metrics by which a given bridge is classified and the rows represent an individual bridge. The columns contain the information listed below. Administration (2024b)

STATE_CODE_001 (1993, 2023): Defines the state or US territory the bridge resides in.
COUNTY_CODE_003 (1993, 2023): Defines the US County or parish the bridge resides in.
STRUCTURE_NUMBER_008 (1993, 2023): Gives an identification number to a given structure.
LAT_016 (1993, 2023): Global Latitude of the structure.
LONG_017 (1993, 2023): Global longitude of a given structure
FUNCTIONAL_CLASS_026 (1993): Functional class of Inventory Rate. (1993)
YEAR_BUILT_027 (1993, 2023): The year of structure construction.
ADT_029 (1993, 2023): Average Daily Traffic, a measure of how many vehicles use the bridge daily.
YEAR_ADT_030 (1993, 2023): Year that average daily traffic is recorded.
STRUCTURE_KIND_043A (1993, 2023): Type of material used and its design.
STRUCTURE_TYPE_043B (1993, 2023): The type of structure.
APPR_TYPE_044B (1993, 2023): The type of design and construction.
MAX_SPAN_LEN_MT_048 (1993, 2023): Maximum length of bridge span.
DECK_COND_058 (1993, 2023): Deck condition rating (0-9).
SUPERSTRUCTURE_COND_059 (1993, 2023): Superstructure condition rating (0-9).
SUBSTRUCTURE_COND_060: Substructure condition rating (0-9).
CULVERT_COND_062 (1993, 2023): The condition of the culverts.
OPR_RATING_METH_063: Method used to determine operating Rating.
INV_RATING_METH_065 (1993, 2023): Method used to determine inventory rating.
INVENTORY_RATING_066 (1993, 2023): Inventory rating.
STRUCTURAL_EVAL_067 (1993, 2023): Overall structural evaluation.
YEAR_RECONSTRUCTED_106 (1993, 2023): Year the structured was constructed.
DECK_STRUCTURE_TYPE_107 (1993, 2023): Type of materials used for the structure’s deck.
PERCENT_ADT_TRUCK_109 (1993, 2023): Average daily truck traffic.
SCOUR_CRITICAL_113 (1993, 2023): The amount of sediment on or flowing on the structure.
BRIDGE_CONDITION (2023): Condition of the bridge as defined by the Code of Federal Regulations (CFR)
LOWEST_RATING (2023): The lowest rating among the deck, superstructure, and substructure.
DECK_AREA (2023): Area of the deck of the structure as defined by the Code of Federal Regulations (CFR)

The columns of interest (vectors of interest) for this investigation are “DECK_COND_058,” “SUPERSTRUCTURE_COND_059,” and SUBSTRUCTURE_COND_060.” Evaluation of the bridge condition is quantified on a scale form 0 to 9 as per the NHA Administration (2024c). The rating criteria is described in the list below.

The condition ratings for bridges are on a scale from 0 to 9:

- 9: Excellent condition.

- 7-8: Good condition.

- 5-6: Fair condition.

- 3-4: Poor condition.

- 0-2: Critical condition requiring immediate attention.

Part III: Import Data into R

The code block below handles the initial set up for the rest of the R script. First, the necessary libraries are imported into R within the function “suppressPackageStartupMessages” to keep the output of the script clean. For this script, the required libraries are readxl, dyplr, stringer, and ggplot2. Next, the working directory is set as the directory that the .qmd file is located so that this location will act as the root directory. Then, the Excel files are read into the script where the NBI data from 1993 is declared as “data93,” and the data from 2023 is declared as “data23.” Finally, the first 5 rows and columns of each data set are displayed to verify they are imported correctly.

# Load necessary libraries
suppressPackageStartupMessages({
  library(readxl)
  library(dplyr)
  library(stringr)
  library(ggplot2)
})
# Set your working directory automatically.
WD <- getwd() 

if (!is.null(WD))
setwd(WD)

# Read in the Excel file
data93 <- read_excel("Source_Files/NBI1993.xlsx")
data23 <- read_excel("Source_Files/NBI2023.xlsx")

# Display the first few rows to understand the structure of the data
head(data93, n = c(5,5))

# A tibble: 5 × 5
  STATE_CODE_001 COUNTY_CODE_003 STRUCTURE_NUMBER_008 LAT_016  LONG_017 
  <chr>          <chr>           <chr>                <chr>    <chr>    
1 01             127             000002               33530000 087170000
2 01             127             000003               33550000 087200000
3 01             127             000004               33440000 087220000
4 01             075             000005               33340000 088041200
5 01             081             000006               32080000 085270000

head(data23, n = c(5,5))

# A tibble: 5 × 5
  STATE_CODE_001 COUNTY_CODE_003 STRUCTURE_NUMBER_008 LAT_016  LONG_017 
  <chr>          <chr>           <chr>                <chr>    <chr>    
1 01             053             00000000000S702      31061094 087341348
2 01             053             00000000000S703      31062020 087340890
3 01             113             0000000000M0022      32174330 084583799
4 01             059             000000883039900      34270600 087583100
5 01             079             000001014002450      34485200 087225400

Part IV: Check normality

With the data correctly imported, it is time to process the data. First, the code block below defines a function to compute the cumulative percentage where. This function brings in a given set of data and then finds the length of that vector. Once the cumulative percentage is computed for each row in the vector, the newly created vector is returned.

##### Cumulative Percentage Calculation Function #####
calc_cum_perc <- function(data_in) {
  n <- length(data_in)
  cum_perc <- ((1:n) - 0.5) / n * 100
  return(cum_perc)
}

The next code block defines a function that is used to prepare a given data set for plotting the Normal probability plot using a transformed scale and the quantile-quantile (Q-Q) plot. Both of these plots check for normality which helps define which hypothesis tests are usable. First, a given data set is passed to the function along with the column name which is the vector in the data frame which needs manipulation, and the group name so that the 1993 and 2023 data sets can be grouped separately. The data does, however, need to be filtered because the three vectors of interest (vectors holding the deck, superstructure, and substructure condition) have rows within their respective vectors thst include blank spaces or letters. The variable “filtered_data” is defined as the incoming data set (either data93 or data23). The following line uses the filter function to remove the blank rows from the data set and the next line filters out the rows with the letters. Each of the vectors of interest are of type “character”. Therefore, the function “str_detect” is used to read a row in a given vector to see if the row contains a letter using a regular expression that looks for only numerical enteries. If this row does include a letter, then the row is removed from the data set. Once the data is filtered, a new variable “working_column” is defined which is the current column of interest. For instance, if the vector for deck condition is being prepared, then the working column is “DECK_COND_058.” The working column is first unlisted and then is is converted from a character type to a double type. Then, the prepared data data frame is created so that the condition of the deck, superstructure, or substructure is sorted from least to greatest, the cumulative percentage is computed, and then the data is grouped according to the data set is comes from. Finally, the working column and prepared data are returned as a list.

#Make a function to prepare the data for plotting
prep_for_plotting <- function(dataset, column_name, group_name) {

  #Filtering out blank spaces and Letters
  filtered_data <- dataset %>%
    filter(!is.na(dataset[[column_name]]),
           str_detect(!!sym(column_name), "^[0-9]+$"))
  
  working_column <- unlist(filtered_data[column_name])
  working_column <- as.numeric(working_column)
  
  prepped_data <- data.frame(
    Condition = sort(working_column),
    CumulativePercentage = calc_cum_perc(working_column),
    Group = group_name
  )

  return(list(working_column = working_column, prepped_data=prepped_data))
  }

The next two functions in the code block below are the plotting functions. The first function plots the normal probability using a transformed scale and the second function plots the Q-Q plot. The first function first binds the two data sets for the 1993 and 2023 data for the given vector of interest. After this, the normal probability plot is created with the bound data set where the x axis is the condition of the surface and the y axis is the cumulative percentage. The x and y label are passed through the function so that they may be changed every time the function is called. The Q-Q plotting function is very similar to the normal probability in function, but this function requires the stat_qq function to convert the incoming data to quantile-quantile by computing the cumulative frequency and, from that, converting it to the standard normal z-score. Both of these functions end in returning the plot.

plotting_norm <- function(df1, df2, xLabel, yLabel) {
  # Combine the dataframes from 1993 and 2023
  df_combined <- rbind(df1, df2)
  
  # Normal probability plot with fitted lines
  plot <- ggplot(df_combined, aes(x = Condition, y = CumulativePercentage, color = Group)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE, linetype = "dashed") +
    labs(x = xLabel,
         y = yLabel) +
    theme_minimal()
  return(plot)
}

plotting_qq <- function(df1, df2, xLabel, yLabel) {

  #Plot the normal probability plot with fitted lines
  plot <- ggplot() +
  stat_qq(aes(sample = df1, color = Group), color = "blue") +
  stat_qq_line(aes(sample = df1), color = "blue") +
  stat_qq(aes(sample = df2, color = Group), color = "red") +
  stat_qq_line(aes(sample = df2), color = "red") +
  labs(x = xLabel,
       y = yLabel) +
  theme_minimal()
  return(plot)
}

The final step before plotting is to pass the data sets “data93,” “data23,” the vector of interest’s name, and the group name through the “prep-for-plotting” function.

##### Checking the Normality of the dataset for Deck Condition #####
data93_deck_cond <- prep_for_plotting(data93, "DECK_COND_058", "Deck_Condition_1993")
data23_deck_cond <- prep_for_plotting(data23, "DECK_COND_058", "Deck_Condition_2023")


data93_sup_cond <- prep_for_plotting(data93, "SUPERSTRUCTURE_COND_059", "Superstructure_Condition_1993")
data23_sup_cond <- prep_for_plotting(data23, "SUPERSTRUCTURE_COND_059", "Superstructure_Condition_2023")


data93_sub_cond <- prep_for_plotting(data93, "SUBSTRUCTURE_COND_060", "Subsutructure_Condition_1993")
data23_sub_cond <- prep_for_plotting(data23, "SUBSTRUCTURE_COND_060", "Subsutructure_Condition_2023")

Now that the data is ready for plotting, the new prepared data and the x and y label names are passed through the plotting functions as shown in the 6 code blocks below.

Figure 1 shows the normal probability plot of the deck condition for the 1993 and 2023 data sets where the former is orange and the latter is green. This plot shows a higher density of occurrences on the right hand side of the x-axis showing that this data is skewed right. This skew means that more of the deck conditions are in the fair to excellent condition. The slopes of thee plots show the variance of the plot which is proportional to the standard deviation. Is the slopes are very different from each other, then a pooled t-test is not advised as the variance are too different. These sloped are quite different, but a pooled t-test is to be performed for the interests of this study.

plotting_norm(data93_deck_cond$`prepped_data`, data23_deck_cond$`prepped_data`,"Bridges (Deck Condition Rating)", "Cumulative Percent")

Figure 1: Normal probability plot of deck condition using a transformed scale: 1993 vs 2023

Figure 2 shows similar results to Figure 1 where we see a right skew meaning that most superstructures are in fair to excellent condition. Furthermore, the slopes of these fitted lines are quite different, but a pooled t-test will be completed.

plotting_norm(data93_sup_cond$`prepped_data`, data23_sup_cond$`prepped_data`, "Bridges (Superstructure Rating)", "Cumulative Percent")

Figure 2: Normal probability plot of superstructure condition using a transformed scale: 1993 vs 2023

Figure 3, again, shows a right skewed data set meaning that most of the substructures are in poor to fair condition. Furthermore, the slopes of these line vary quite a bit, but a pooled t-test is performed for completion.

plotting_norm(data93_sub_cond$`prepped_data`, data23_sub_cond$`prepped_data`, "Bridges (Substructure Condition Rating)", "Cumulative Percent")

Figure 3: Normal probability plot of substructure condition using a transformed scale: 1993 vs 2023

Another way to observe the distribution of a data set is through a Q-Q plot. A Q-Q plot’s x-axis shows the normal distribution of quantiles where the y-axis show the sample quantiles. The closer these are to the fitted line, the closer the data set is to a normal distribution.

Figure 4 shows the Q-Q plot for the deck condition in 1993 (blue) and 2023 (red). Here, it is evident that the slope, or variance, of these plots are very different. In fact, they are worse than the normal probability and it is skewed left meaning that the mean deck condition is higher than the median deck condition.

plotting_qq(data93_deck_cond$`working_column`, data23_deck_cond$`working_column`, "Theoretical Quantiles", "Sample Quantiles (Superstructure Rating)")

Figure 4: Normal probability Q-Q plot of deck condition: 1993 (Blue) vs 2023 (Red)

Figure 5 Reveals a similar result to Figure 4 where the plot is skewed left and the slope of the lines are very dissimilar.

plotting_qq(data93_sup_cond$`working_column`, data23_sup_cond$`working_column`, "Theoretical Quantiles", "Sample Quantiles (Superstructure Rating)")

Figure 5: Normal probability Q-Q plot of superstructure condition: 1993 (Blue) vs 2023 (Red)

Figure 6, again shows a similar result to Figure 4 and Figure 5 where the slopes vary dramatically and the data is skewed left.

plotting_qq(data93_sub_cond$`working_column`, data23_sub_cond$`working_column`, "Theoretical Quantiles", "Sample Quantiles (Superstructure Rating)")

Figure 6: Normal probability Q-Q plot of substructure condition: 1993 (Blue) vs 2023 (Red)

Analysis of this data shows that this data follows a uniform distribution rather than a normal distribution as the trend lines more closely align with the data on the normal probability plots than the Q-Q plots. However, according to the central limit theorem, the tests that are run in this study do not require the data to be normally distributed because the sample size is so large (n>30). In this case, there is enough data that the distribution will have a negligible effect on the results.

Part V: Perform an F Test for Equality of Variances

The f-test shows how different the variances are with between two data sets for a given confidence interval. The code block below shows a simple function which takes in the vectors of interest for 1993 and 2023 and runs an f-test to compare the variance of these data sets with 95% confidence then returns the result of the test. This test is run for each vector of interest and then taken out of a list usinf the unlost function for visualization.

# default two tail test, conf.level=0.95 
f_test <- function(df1, df2) {
  test_result <- var.test(df1,df2,ratio=1,alternative="two.sided",conf.level=0.95)
  return(test_result)
}

deck_cond_F_test_result <- f_test(data93_deck_cond$`working_column`, data23_deck_cond$`working_column`)
deck_cond_F_test_result <- unlist(deck_cond_F_test_result)

sup_cond_F_test_result <- f_test(data93_sup_cond$`working_column`, data23_sup_cond$`working_column`)
sup_cond_F_test_result <- unlist(sup_cond_F_test_result)

sub_cond_F_test_result <- f_test(data93_sub_cond$`working_column`, data23_sub_cond$`working_column`)
sub_cond_F_test_result <- unlist(sub_cond_F_test_result)

The results for the f-test for the deck condition are shown below. Here, the f-statistic (1.673) is greater than the critical value 1 (1.666). Furthermore, because the p-value is zero, which is less than 0.05, we reject the null hypothesis and accept the alternative since the variances are indeed different.

print("F-Test Result for Deck Condition:")

[1] "F-Test Result for Deck Condition:"

print(deck_cond_F_test_result)

                      statistic.F                  parameter.num df 
               "1.67286188932872"                          "469711" 
               parameter.denom df                           p.value 
                         "469888"                               "0" 
                        conf.int1                         conf.int2 
               "1.66611682859832"                "1.67964818225032" 
      estimate.ratio of variances     null.value.ratio of variances 
               "1.67286188932872"                               "1" 
                      alternative                            method 
                      "two.sided" "F test to compare two variances" 
                        data.name 
                    "df1 and df2"

The f-test for the superstructure condition reveals the same conclusion. The f-statistic (1.581) is greater than critical value 1 (1.574) and the p-value is zero which is less than 0.05. Therefore, we reject the null hypothesis and select the alternative because the variances are significantly different.

print("F-Test Result for Superstructure Condition:")

[1] "F-Test Result for Superstructure Condition:"

print(sup_cond_F_test_result)

                      statistic.F                  parameter.num df 
               "1.58063043734317"                          "475265" 
               parameter.denom df                           p.value 
                         "474956"                               "0" 
                        conf.int1                         conf.int2 
               "1.57427953117701"                "1.58699395182095" 
      estimate.ratio of variances     null.value.ratio of variances 
               "1.58063043734317"                               "1" 
                      alternative                            method 
                      "two.sided" "F test to compare two variances" 
                        data.name 
                    "df1 and df2"

The final f-test for the substructure condition shows that the f-statistic (1.646) is greater than critical value 1 (1.639) and the p-value is still zero. Therefore, the null hypothesis is again rejected since the variance between the two data sets are significantly different.

print("F-Test Result for Substructure Condition:")

[1] "F-Test Result for Substructure Condition:"

print(sub_cond_F_test_result)

                      statistic.F                  parameter.num df 
                "1.6456888278034"                          "475788" 
               parameter.denom df                           p.value 
                         "474802"                               "0" 
                        conf.int1                         conf.int2 
               "1.63907544859456"                "1.65231533853686" 
      estimate.ratio of variances     null.value.ratio of variances 
                "1.6456888278034"                               "1" 
                      alternative                            method 
                      "two.sided" "F test to compare two variances" 
                        data.name 
                    "df1 and df2"

Part VI: Perform Hypothesis Test

To further understand this data, t-tests are performed on the 1993 and 2023 data sets to compare them. It is known that pooled t-tests are not good for data sets with significantly different variances, but this investigation will complete a pooled t-test for a deeper understanding. The function below reads in the two data sets passed to it, completes both pooled and Welch t-tests, the returns a list with the results of each test inside.

pooled_welch_test <- function(df1, df2) {
  pooled_t <- t.test(df1, df2, alternative="two.sided", var.equal = TRUE) 
  
  #assuming variances are different , Welch's or separate-t test
  welch_t <- t.test(df1, df2, alternative="two.sided",var.equal = FALSE) 
  
  return(list(pooled_t = pooled_t, welch_t = welch_t))
}

The code chunk below calls the function to complete the pooled and Welch t-test and stores as lists.

deck_cond_result <- pooled_welch_test(data93_deck_cond$`working_column`, data23_deck_cond$`working_column`)

sup_cond_result <- pooled_welch_test(data93_sup_cond$`working_column`, data23_sup_cond$`working_column`)

sub_cond_result <- pooled_welch_test(data93_sub_cond$`working_column`, data23_sub_cond$`working_column`)

The underlying assumptions for the pooled t-tests is that the populations are independent, the population variances are equal, and the set of data is either normally distributed or very large. The first assumption is met since the data comes from two different sets of deck, superstructure, and substructure condition rating from 30 years apart. One observation is that pooled t-tests are not ideal for populations with dramatically different variances. Based on the normality plots and the f-tests, it is shown the the variances are different. However, for the sake of experimentation and learning, the pooled t-test is still completed for a deeper investigation. The Welch t-test does not assume the variances are similar, but does assumes that the sets of data are independent, normally distributed, and there are no significant outliers. As previously stated, the data sets are independent and, based on the normality plots, are asymmetrically normally distributed. Finally, there are no significant outliers within the data since this is a categorical data set with set upper and lower bounds. It is important to note that since both of these tests for each vector is computed with 95% confidence. Therefore, the critical p-value is 0.05.

The results for both the pooled and Welch t-test for the deck condition are shown below. Both of these tests reveal the the p-value is greater than the critical p-value as 0.196 > 0.05. Therefore, the null hypothesis is not rejected. Based on this information, there is not enough evidence to prove that the deck condition in 2023 is different from the deck condition in 1993.

print("Pooled and Welch T-Test Results for Deck Condition:")

[1] "Pooled and Welch T-Test Results for Deck Condition:"

print(deck_cond_result$pooled_t)


    Two Sample t-test

data:  df1 and df2
t = -1.2928, df = 939599, p-value = 0.1961
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.008372583  0.001717379
sample estimates:
mean of x mean of y 
 6.550263  6.553590

print(deck_cond_result$welch_t)


    Welch Two Sample t-test

data:  df1 and df2
t = -1.2927, df = 883450, p-value = 0.1961
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.008372823  0.001717619
sample estimates:
mean of x mean of y 
 6.550263  6.553590

Below are the results for the pooled and Welch t-tests for the superstructure condition. Both of these tests show that the p-value is less than the critical p-value. The pooled t-test shows that p-value = 7.277e-9 < 0.05 and the Welch t-test shows that the p-value = 7.259e-9 < 0.05. Therefore, the null hypotheses for both of these tests are rejected and the alternative is selected meaning that there is a difference between the superstructure conditions in 1993 and 2023.

print("Pooled and Welch T-Test Results for Superstructure Condition:")

[1] "Pooled and Welch T-Test Results for Superstructure Condition:"

print(sup_cond_result$pooled_t)


    Two Sample t-test

data:  df1 and df2
t = -5.7845, df = 950221, p-value = 7.277e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.02095648 -0.01034910
sample estimates:
mean of x mean of y 
 6.593533  6.609186

print(sup_cond_result$welch_t)


    Welch Two Sample t-test

data:  df1 and df2
t = -5.7849, df = 904681, p-value = 7.259e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.02095609 -0.01034949
sample estimates:
mean of x mean of y 
 6.593533  6.609186

The results for the pooled and Welch t-tests for the substructure condition are shown below. Machine zero is the closest point a computer can get to zero through computation since computers are mathematically discrete rather than continuous in nature. Both p-values for each test of the substructure achieves double precision machine zero at 2.2e-16. Therefore, the p-value for the pooled and Welch t-test is zero which is less that the critical p-value of 0.05. Consequently, the null hypotheses for both tests are rejected and the alternative is selected meaning that there is a difference between the substructure condition in 1993 and 2023.

print("Pooled and Welch T-Test Results for Substructure Condition:")

[1] "Pooled and Welch T-Test Results for Substructure Condition:"

print(sub_cond_result$pooled_t)


    Two Sample t-test

data:  df1 and df2
t = -29.921, df = 950590, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.08747532 -0.07671973
sample estimates:
mean of x mean of y 
 6.431189  6.513287

print(sub_cond_result$welch_t)


    Welch Two Sample t-test

data:  df1 and df2
t = -29.928, df = 897983, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.08747396 -0.07672110
sample estimates:
mean of x mean of y 
 6.431189  6.513287

A paired t-test is a better test for these data sets because there are two condition ratings for the two different years the rating are recorded. Some of the ratings are for the for the same structure so it is possible to compare each structure directly. The paired t-test needs some data filtering and manipulation before it is completed. A paired t-test is a test where the investigator has two samples from the same bridge but are years apart. Therefore, the bridges must be correlated between the two data sets of 1993 and 2023 so that the same bridge is being compared 30 years apart. First, a data set along with a vector of interest’s header name is passed to the function where it is first filtered as before in the function “prep_for_plotting.” The vector we are using to correlate similar bridges is the “STRUCTURE_NUMBER_008” vector which is declared as the “structure_number” variable. The “working_column” variable is also unlisted and converted to numeric just as the function “prep_for_plotting” does. Finally, the structure number and working column are combined in a list and returned.

paired_test_filter <- function(dataset, column_name) {
  filtered_data <- dataset %>%
    filter(!is.na(dataset[[column_name]]),
             str_detect(!!sym(column_name), "^[0-9]+$"))
  
  structure_number <- filtered_data$`STRUCTURE_NUMBER_008`
  
  working_column <- unlist(filtered_data[column_name])
  working_column <- as.numeric(working_column)
  
  paired_test_data <- list(structure_number = structure_number, working_column = working_column)
    
    return(paired_test_data)
}

The code block below manipulates the data and runs the t-test. First, df1 and df2 (data93 and data23, respectively) along with the vector of interest’s header name are passed to the function. df1 and df2 are then filtered through the function in the previous code block. The 2023 and 1993 vectors of interest are not of the same length. These vectors need to be the same length so that the matching and paired t-test function can index these correctly and match the structure numbers. To make these lists containing the structure number and the working column the same length, the vectors of these lists are found first. Next, the script needs to know which vector is longer and this is completed through an if statement. This statement declares that if length of the 1993 data set is longer than the 2023 data set, the longer and shorter length variables are set as the lengths of 1993 data and 2023 data, respectively. This if statement also defines the longer structure and working vector as the 1993 data set. If this inequality is not true, then the else statement is used where the longer and shorter vector definition is switched and the longer structure and working vectors are from the 2023 data set. The next set of code is a while loop which parses the longer vector and removes rows at random indicies until the lengths of the longer and shorter data set are the same. There are so many samples from this data set that deleting a couple hundred of samples from the longer one is negligible. The final if statement replaces the working column from the df93 or df23 with the longer structure and working column manipulated in this function. To match the rows in the vector of interest to the structure number, a data frame for the 1993 and 2023 are created first. Next, the merge function is used to merge the two data frames by vector 1 (structure number) and then is converted back to a list to remain consistent with the rest of the script. Finally, the paired t-test is run and the results are returned.

paired_test <- function(df1, df2, column_name) {

  df93 <- paired_test_filter(df1, column_name)
  df23 <- paired_test_filter(df2, column_name)
  
  # Determine the longer vector and its length
  len1 <- length(df93$`structure_number`)
  len2 <- length(df23$`structure_number`)

  if (len1>len2){
      longer_length  <- len1
      shorter_length <- len2
      longer_structure_vector  <- df93$`structure_number`
      longer_working_vector   <- df93$`working_column`
  } else {
      longer_length  <- len2
      shorter_length <- len1
      longer_structure_vector  <- df23$`structure_number`
      longer_working_vector    <- df23$`working_column`
  }
  
  while (longer_length > shorter_length) {
    index_to_remove <- sample(1:longer_length, 1, replace = FALSE)
    longer_structure_vector <- longer_structure_vector[-index_to_remove]
    longer_working_vector <- longer_working_vector[-index_to_remove]
    longer_length <- length(longer_structure_vector)
  }
  
  if (len1>len2){
      df93$`structure_number` <- longer_structure_vector
      df93$`working_column`   <- longer_working_vector
  } else {
      df23$`structure_number` <- longer_structure_vector
      df23$`working_column`   <- longer_working_vector
  }
  

# Sample data
  df1 <- data.frame(
    vector1 = df93$`structure_number`,
    vector2 = df93$`working_column`
  )
  
  df2 <- data.frame(
    vector1 = df23$`structure_number`,
    vector2 = df23$`working_column`
  )
  
  # Merge the two data frames by 'vector1'
  merged_df <- merge(df1, df2, by = "vector1")
  
  # Convert back to a list
  new_list <- list(new_vector1 = merged_df$vector2.x, new_vector2 = merged_df$vector2.y)

  paired_t <- t.test(new_list$`new_vector1`, new_list$`new_vector2`, paired = TRUE)
    return(paired_t)
  
    }

The code blocks below passedsthe 1993 and 2023 data sets to the “paired_test” function as wells as the vector of interest’s name to obtain the paired t-test result. Each of these paired t-tests are run with a 95% confidence interval. Consequently, the critical p-value is 0.05.

Examining the results below for the paired t-test for the deck condition, it is shown that the p-value is 2.2e-16 which is machine zero using double precision. Since the p-value = 0 < 0.05, the null hypothesis is rejected and the alternative is selected meaning that there is a difference between the deck conditions between 1993 and 2023 when comparing identical structures.

print("Paired T-Test Results for Deck Condition:")

[1] "Paired T-Test Results for Deck Condition:"

paired_deck_cond_result <- paired_test(data93, data23, "DECK_COND_058")
print(paired_deck_cond_result)


    Paired t-test

data:  new_list$new_vector1 and new_list$new_vector2
t = 138.38, df = 311219, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 0.3860035 0.3970953
sample estimates:
mean difference 
      0.3915494

The paired t-test results below are for the superstructure vector. Like the deck condition result, the p-value is machine zero. Therefore, since the p-value is less than 0.05, the null hypothesis is rejected and the alternative is selected. This conclusion shows that there is a difference in superstructure condition between 1993 and 2023 when comparing identical bridges.

print("Paired T-Test Results for Superstructure Condition:")

[1] "Paired T-Test Results for Superstructure Condition:"

paired_sup_cond_result <- paired_test(data93, data23, "SUPERSTRUCTURE_COND_059")
print(paired_sup_cond_result)


    Paired t-test

data:  new_list$new_vector1 and new_list$new_vector2
t = 160.6, df = 316830, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 0.4544260 0.4656544
sample estimates:
mean difference 
      0.4600402

The final paired t-test examines the substructure vector which obtains the same conclusion as the previous two. The p-value is machine zero for this test as well and, consequently, the null hypothesis is rejected and the alternative hypothesis is selected. Therefore, it is concluded that there is a difference between the substructure conditions in 1993 and 2023 when comparing identical structures.

print("Paired T-Test Results for Substructure Condition:")

[1] "Paired T-Test Results for Substructure Condition:"

paired_sub_cond_result <- paired_test(data93, data23, "SUBSTRUCTURE_COND_060")
print(paired_sub_cond_result)


    Paired t-test

data:  new_list$new_vector1 and new_list$new_vector2
t = 142.82, df = 316579, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 0.4062808 0.4175867
sample estimates:
mean difference 
      0.4119338

Part VII: Conclusions

This analysis compares the condition of bridge decks, superstructures, and substructures. First, the data sets are plotted as a normal probability plot using a normally transformed sale and a Q-Q plot. The population variances are represented by the slope of the fitted line and the variances are closer on the normal probability plot than the Q-Q plot. From these plots, the data is asymmetrically uniformly distributed since the data matched the fitted lines on the normal probability plots closer than the Q-Q plots.

Each of the F-tests reject the null hypotheses since the p value is less than the critical p-value of 0.05. This shows that the variances of the data sets are not similar. From this, it is concluded that there is a difference between the deck, superstructure, and substructure conditions between 1993 and 2023. Next, the pooled and Welch t-tests and conducted. The deck condition does not reject the null hypothesis as the p-value for both the pooled and Welch t-tests are greater than the critical p-value of 0.05. This shows that there is not much of a difference between the deck conditions in 1993 compared to 2023. The superstructure and the substructure, however, do reject the null hypothesis meaning that there is a difference between the conditions of the bridges substructure and substructure. This discrepancy in hypothesis rejection is likely due to the fact that the deck of these bridges are repaved through out the years, but the superstructure and substructure are not maintained as well and, therefore, deteriorate over time.

Finally, a paired t-test is performed on the deck, superstructure, and substructure condition. After some data filtering and manipulation the structure numbers are matched between the 1993 and 2023 data sets for the paired t-test. For each data set, the null hypothesis is rejected as the p-value for each test is less than the critical p-value of 0.05. The paired test shows that for a given structure, there is a difference in the bridge’s deck, superstructure, and substructure condition.

References

Administration, Transportation Highway. 2024a. “NBI ASCII Files.” https://www.fhwa.dot.gov/bridge/nbi/ascii.cfm.

———. 2024b. “NBI Record Format.” https://www.fhwa.dot.gov/bridge/nbi/format.cfm.

———. 2024c. “Tables of Frequently Requested NBI Information.” https://www.fhwa.dot.gov/bridge/britab.cfm.