Welcome to the RMarkdown of my final project of my Biotechnology degree at the UAB. In this document you will find all the steps that were made to generate the figures and tables of the report.

The first thing I did was to open the excel that came from the Python processing of the XML files generated by JMrui. Of which, “tidyverse” allows you to do basic to complex tasks related to data management, “readxl” allows you to read excels, while “writexl” allows you to export them, and “janitor” allows you to clean certain values (like column titles) to be able to treat them more easily.

Read excel

Once the libraries are installed, I proceed to open the excel that came from the Python processing of the XML files generated by JMrui.

library(readxl)
excel_path = "C:/Users/User/Desktop/Python_Outputs_TFG.xlsx"
full_excel= read_excel(excel_path) 

Spectra analysis

Plotting all spectra superimposed

Once opened, I had to plot all the spectra according to MGMT status. This involved first clearing the column titles so that I could work with them, using “janitor”, followed by changing the “Positive” values to “Methylated” and “Negative” values “Unmethylated”, respectively.

library(janitor)
full_excel <- full_excel %>% clean_names()
full_excel$mgmt_status[full_excel$mgmt_status=="Negative"] <- "Unmethylated"
full_excel$mgmt_status[full_excel$mgmt_status=="Positive"] <- "Methylated"

Then, I grouped by methylation status by using the subset function, and with “ggplot2” the values with several aesthetic touches could be plotted.

mgmtpositiu <- full_excel %>% subset(mgmt_status == "Methylated")

library(ggplot2)
ggplot(data=mgmtpositiu, aes(x=x_axis, y=y_axis, group=id), lwd=1) +
  geom_line(color="#f8766d")+
  theme_minimal()+
  ylim(-0.2,0.4)+
  scale_x_reverse()+
  ggtitle("Overlaying of all methylated MGMT spectra")+
  xlab("Frequency (ppm)")+
  ylab("Intensity of signal (u.a)")

Afterwards, I did the same but with the color #00bfc4 to represent MGMT unmethylated patients.

Plotting the average intensity for each point spectra

This procedure was already more complicated than the previous ones. For this, I had to filter by frequency (in ppm) and calculate the average with its respective standard deviation.

With this data I was able to generate a clean spectrum that represented the sample, more similar to what I was used to from the literature.

library(tidyverse)
mitjana_pos <- full_excel %>% 
  subset(mgmt_status == "Methylated") %>% 
  group_by(x_axis) %>% 
  summarise(mean = mean(y_axis), sd=sd(y_axis))
mitjana_pos["lower"] <- mitjana_pos$mean - mitjana_pos$sd 
mitjana_pos["upper"] <- mitjana_pos$mean + mitjana_pos$sd 
head(mitjana_pos)
## # A tibble: 6 x 5
##   x_axis     mean      sd    lower    upper
##    <dbl>    <dbl>   <dbl>    <dbl>    <dbl>
## 1  -2.68  0       0        0       0       
## 2  -2.66 -0.00186 0.00313 -0.00499 0.00128 
## 3  -2.64 -0.00141 0.00367 -0.00508 0.00227 
## 4  -2.62 -0.00185 0.00347 -0.00532 0.00163 
## 5  -2.60 -0.00209 0.00301 -0.00511 0.000921
## 6  -2.58 -0.00196 0.00354 -0.00551 0.00158

In order to plot the standard deviation above and below, I had to add one standard deviation above (upper) and one standard deviation below (lower) each average point. This way, I could represent it with shading.

Once this is achieved, I plotted the data, adding a shading between the spectrum generated by the upper and lower standard deviation:

ggplot(data=mitjana_pos) +
  geom_line(aes(x=x_axis, y=mean), color="#f8766d", lwd=1)+  
  scale_x_reverse()+
  theme_minimal()+
  ggtitle("Average intensity per frequency of all methylated MGMT spectra")+
  xlab("Frequency (ppm)")+
  ylab("Intensity of signal (u.a)")+
  ylim(-0.05,0.3) +
  geom_line(aes(x=x_axis, y=upper), color="#f8766d", alpha = 0.3)+
  geom_line(aes(x=x_axis, y=lower), color="#f8766d", alpha = 0.3)+
  geom_ribbon(aes(x = x_axis,
                  ymin = upper,
                  ymax = lower),
              fill = "#f8766d",
              alpha = 0.3)

Plot of the superimposition of both average spectra

For this last spectra plot, in order to observe differences with the naked eye, I had to join the two previous plots, so I worked with the mean and sd data, but this time without filtering it by MGMT status.

comparison <- full_excel %>% 
  group_by(x_axis, mgmt_status) %>% 
  summarise(mean_point = mean(y_axis),sd=sd(y_axis))
comparison["lower"] <- comparison$mean_point - comparison$sd 
comparison["upper"] <- comparison$mean_point + comparison$sd 

Finally, the data was plotted:

ggplot(data=comparison) +
  geom_line(aes(x=x_axis, y=mean_point, col=mgmt_status),lwd=1)+
  geom_line(aes(x=x_axis, y=upper, col=mgmt_status), alpha = 0.3)+
  geom_line(aes(x=x_axis, y=lower, col=mgmt_status), alpha = 0.3)+
  geom_ribbon(aes(x = x_axis, ymin = upper, ymax = lower, fill = mgmt_status), alpha = 0.3)+
  scale_x_reverse()+
  theme_minimal()+
  ggtitle("Average intensity per frequency according to MGMT methylation status")+
  xlab("Frequency (ppm)")+
  ylab("Intensity of signal (u.a)")+
  ylim(-0.05,0.3) +
  scale_color_manual(values = c("#f8766d","#00bfc4"))+
  labs(color='MGMT status') +
  guides(fill = "none")

As we can appreciate, there are no visible differences, as I mentioned in the TFG article and poster.

Peak values and ratios distribution analysis

Metabolite peak distribution

Next, I moved on to check, from a perspective already of the important and typical peaks for MRS analysis, such as lipids, NAA, Creatine, Choline groups or Glycine/myo-Inositol.

For this purpose, I decided to use violin plots, as they make it easier to observe the distribution of values than boxplots.

I then looked for the highest point between two frequency values closest to the peak of these metabolites according to the literature, to find our frequency of choice.

Here there is an example of the methyl peak with frequency 0.9 according to the literature:

value_lipids_0_9_summary <- mitjana %>% 
  subset(x_axis<1 & x_axis>0.8) 
head(value_lipids_0_9_summary)
## # A tibble: 6 x 3
##   x_axis   mean     sd
##    <dbl>  <dbl>  <dbl>
## 1  0.803 0.0558 0.0178
## 2  0.822 0.0611 0.0201
## 3  0.841 0.0668 0.0220
## 4  0.86  0.0726 0.0213
## 5  0.879 0.0776 0.0193
## 6  0.898 0.0794 0.0225

The “x-axis” represents the frequency and the “mean” represents the intensity, of which I are interested in finding the maximum value.

After that, the maximum was measured with this function:

maximum <- max(value_lipids_0_9_summary$mean, na.rm = FALSE)
max_value <- value_lipids_0_9_summary %>% 
  filter(mean == maximum)
head(max_value)
## # A tibble: 1 x 3
##   x_axis   mean     sd
##    <dbl>  <dbl>  <dbl>
## 1  0.898 0.0794 0.0225

Finally, I filtered out the patient outliers and plot according to MGMT status.

value_lipids_0_9_summary <- full_excel %>% 
  subset(x_axis == 0.898, id!="16486790")
Methyl_plot <- ggplot(value_lipids_0_9_summary, aes(y = y_axis, x= mgmt_status, fill = mgmt_status))+
  geom_violin(trim=FALSE, color="black", lwd=1)+
  geom_boxplot(width=0.1,fill="white",color="black", lwd=1)+
  theme_minimal()+
  ylim(-0.05,0.4)+
  labs(title="Methyl peak at 0.898 ppms", y = "Intensity of signal (u.a)", fill="MGMT status")+
  theme(legend.position = "none", axis.title.x = element_blank())
plot(Methyl_plot)

This process, repeated for all metabolites, allows us to measure the maximum point of all peaks and to plot the whole by using the “patchwork” library:

library("patchwork")
(NAA_plot + Cr_plot + Cho_plot) + plot_layout(guides = 'collect', ncol = 3)

(Methyl_plot + Methylene_plot + mI_plot) + plot_layout(guides = 'collect', ncol = 3)

Ratios

It is also interesting to obtain the ratios, in order to obtain a uniform and standardized value between the ratio of these peaks.

First I made a database by selecting only the frequency values of the maximum peaks and calculated its mean values to obtain one value for each metabolite according to MGMT status.

df_metab <- full_excel %>% 
  clean_names() %>% 
  subset(x_axis == 1.281 | x_axis ==2.066 | x_axis == 0.898 | x_axis == 3.042 | x_axis ==3.214 | x_axis ==3.559)

df_metab["metab"] <- df_metab$x_axis
df_metab$metab[df_metab$metab == 0.898] <- "Methyl"
df_metab$metab[df_metab$metab == 1.281] <- "Methylene"
df_metab$metab[df_metab$metab == 2.066] <- "NAA"
df_metab$metab[df_metab$metab == 3.042] <- "Cr"
df_metab$metab[df_metab$metab == 3.214] <- "Cho"
df_metab$metab[df_metab$metab == 3.559] <- "mIo"

metab_plot <- df_metab %>% 
  subset(id!="16486790" & id!="12955113" & id!="12973765" & id!="12938551")

Once I had these data, I calculated the ratios between the peaks, such as NAA/Cr.

NAA <- metab_plot %>% 
  subset(metab == "NAA", select = -c(x_axis, metab)) %>% 
  group_by(id, mgmt_status) %>% 
  rename(NAA_values = y_axis)

Cr <- metab_plot %>% 
  subset(metab == "Cr", select = -c(x_axis, metab)) %>% 
  group_by(id, mgmt_status) %>% 
  rename(Cr_values = y_axis)

NAA_Cr <- inner_join(NAA, Cr, by=c("id"="id", "mgmt_status"="mgmt_status"))
NAA_Cr["values"] <- NAA_Cr$NAA_values/NAA_Cr$Cr_values
NAA_Cr <- NAA_Cr %>% 
  subset(select = -c(NAA_values,Cr_values)) %>% 
  subset(id != "10497431" & id!= "13046576" & id!="32652" & id!= "16342780" & id!="12951453" & id != "13024511" & id != "14300587" & id != "13395008")
NAA_Cr["ratio"] <- "NAA/Cr"

ggplot(NAA_Cr, aes(y = values, x= mgmt_status, fill = mgmt_status))+
  geom_violin(trim=FALSE, color="black", lwd=1)+
  geom_boxplot(width=0.1,fill="white",color="black", lwd=1)+
  theme_minimal()+
  ylim(-0.5,5)+
  labs(title="NAA/Cr ratio", y = "Intensity of signal (u.a)", fill="MGMT status")+
  theme(axis.text.x = element_blank())

This process, repeated for all metabolites, allows us to measure the maximum point of all peaks and to plot the whole by using the “patchwork” library:

(NAACr_plot + ChoCr_plot + NAACho_plot) + plot_layout(guides = 'collect', ncol = 3)

By analyzing all the plots, we can conclude that there seems to be an homogeneity in both groups.

Statistical analysis for metabolites between groups

Finally, to confirm that the spectra are significantly equal, I performed two statistical tests with p=0.05: Mann Whitney U test and t-student test. The reason for using these tests is in the TFG report.

The hypotheses to confirm for this analysis are:

Null hypothesis: Metabolite peak values from methylated and non-methylated MGMT patient are different.
Alternative hypothesis: Metabolite peak values from methylated and non-methylated MGMT patients are equal.

To do these calculations, I used the wilcox function along wit the t.test function with independent samples (paired = FALSE) and mu is a factor of the null hypothesis, which in independent cases is 0.

NAA_meth <- metab_plot %>% 
  filter(metab == "NAA", mgmt_status=="Methylated") %>% 
  subset(select = -c(x_axis, metab)) %>% 
  group_by(id, mgmt_status) %>% 
  rename(NAA_meth = y_axis)

NAA_unmeth <- metab_plot %>% 
  filter(metab == "NAA", mgmt_status=="Unmethylated") %>% 
  subset(select = -c(x_axis, metab)) %>% 
  group_by(id, mgmt_status) %>% 
  rename(NAA_unmeth = y_axis)

t_test_result = t.test(NAA_meth$NAA_meth, NAA_unmeth$NAA_unmeth, mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95)
t_test_result
## 
##  Welch Two Sample t-test
## 
## data:  NAA_meth$NAA_meth and NAA_unmeth$NAA_unmeth
## t = 0.58895, df = 185.13, p-value = 0.5566
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.004180724  0.007739100
## sample estimates:
##  mean of x  mean of y 
## 0.07885887 0.07707968
wilcox_test_result = wilcox.test(NAA_meth$NAA_meth, NAA_unmeth$NAA_unmeth, mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95)
wilcox_test_result
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  NAA_meth$NAA_meth and NAA_unmeth$NAA_unmeth
## W = 4915, p-value = 0.6898
## alternative hypothesis: true location shift is not equal to 0

As we can see, the results of both tests have a p-value>0.05, so we would accept the null hypothesis and conclude that the two groups are significantly similar.

And this concludes the studies I did with Rstudio for the TFG report. The discussion and conclusions can be found in the report itself.

Thank you.