First, we must know what is the data look like, to do this we go for Exploratory Data Analysis (EDA), in R we can use dim() and glimpse() function. the result as follow:
dim(nusantics)
## [1] 708 18
glimpse(nusantics)
## Rows: 708
## Columns: 18
## $ OTU_ID <chr> "Unassigned;__;__;__;__;__", "d__Bacteria;__;__;__;__;__...
## $ V_H_T_S42 <dbl> 4.757438e-04, 1.603631e-04, 1.069087e-05, 0.000000e+00, ...
## $ V_H_U_S47 <dbl> 1.761104e-04, 1.806260e-05, 3.612521e-05, 0.000000e+00, ...
## $ V_I_T_S48 <dbl> 6.711493e-04, 4.682437e-05, 0.000000e+00, 0.000000e+00, ...
## $ V_I_U_S4 <dbl> 6.755967e-04, 3.897673e-05, 0.000000e+00, 0.000000e+00, ...
## $ V_M_T_S30 <dbl> 1.249270e-03, 0.000000e+00, 0.000000e+00, 0.000000e+00, ...
## $ V_M_U_S35 <dbl> 6.459123e-04, 6.391840e-04, 0.000000e+00, 0.000000e+00, ...
## $ V_R_T_S18 <dbl> 2.908848e-03, 0.000000e+00, 0.000000e+00, 0.000000e+00, ...
## $ V_R_U_S23 <dbl> 0.0023409331, 0.0000000000, 0.0000000000, 0.0000000000, ...
## $ V_S_T_S5 <dbl> 3.551619e-03, 0.000000e+00, 1.789229e-05, 0.000000e+00, ...
## $ V_S_U_S11 <dbl> 2.328064e-03, 0.000000e+00, 0.000000e+00, 0.000000e+00, ...
## $ V_Sh_T_S24 <dbl> 1.379366e-03, 0.000000e+00, 0.000000e+00, 5.034182e-05, ...
## $ V_Sh_U_S29 <dbl> 0.0010325008, 0.0000000000, 0.0000000000, 0.0000000000, ...
## $ V_SO_1_S3 <dbl> 0.000000e+00, 9.371896e-05, 0.000000e+00, 0.000000e+00, ...
## $ V_SO_2_S10 <dbl> 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, ...
## $ V_S0_3_S17 <dbl> 0.0000000000, 0.0000000000, 0.0000000000, 0.0000000000, ...
## $ V_V_T_S36 <dbl> 2.043031e-03, 0.000000e+00, 0.000000e+00, 0.000000e+00, ...
## $ V_V_U_S41 <dbl> 4.245535e-03, 5.493298e-05, 0.000000e+00, 0.000000e+00, ...
From the summary above we know there is 708 rows and 18 columns, column represent a sample, and the rows represent the abundance of each bacteria in each sample. so in this dataset we have 708 bacteria, and there is not a single sample that contain all of them, for instance, we can see that in sample V_I_T_S48 there are several bacteria that have zero value. so the next step is i want to know how many types of bacteria in each sample. and the result will be shown as a table and a barchart here:
n_bacteria_sample %>% arrange(desc(n))
## Sample n
## 1 V_H_T_S42 410
## 2 V_H_U_S47 331
## 3 V_M_U_S35 237
## 4 V_Sh_T_S24 201
## 5 V_M_T_S30 198
## 6 V_Sh_U_S29 186
## 7 V_I_T_S48 162
## 8 V_V_T_S36 162
## 9 V_R_U_S23 140
## 10 V_V_U_S41 140
## 11 V_S_T_S5 130
## 12 V_I_U_S4 117
## 13 V_S_U_S11 113
## 14 V_R_T_S18 100
## 15 V_S0_3_S17 43
## 16 V_SO_1_S3 33
## 17 V_SO_2_S10 32
n_bacteria_sample %>%
plot_ly(x = ~fct_reorder(Sample, n, .desc = TRUE), y = ~n) %>%
add_bars() %>%
layout(title = "Jumlah jenis bakteri tiap Sample",
xaxis = list(title = "Sample"),
yaxis = list(title = "Jumlah jenis bakteri (n)"))
based on barchart, sample V_H_T_S42 have the highest n kind of bacteria contain (410 kind of bacteria), and sample V_SO_2_S10 have the least of all (only 32 kind of bacteria). but from the barchart we still can’t see the distribution among the bacteria, so the next visualization is to plot the abundance of each bacteria in each sample. Because there are 17 sample, trelliscope packages and faceting technique will be handy. and the visualization for each sample will be like this:
nusantics_log %>%
plot_ly(x = ~OTU_ID, y = ~V_H_T_S42) %>%
add_markers()
nusantics_log %>%
plot_ly(x = ~OTU_ID, y = ~V_H_U_S47) %>%
add_markers()
nusantics_log %>%
plot_ly(x = ~OTU_ID, y = ~V_I_T_S48) %>%
add_markers()
nusantics_log %>%
plot_ly(x = ~OTU_ID, y = ~V_I_U_S4) %>%
add_markers()
nusantics_log %>%
plot_ly(x = ~OTU_ID, y = ~V_M_T_S30) %>%
add_markers()
nusantics_log %>%
plot_ly(x = ~OTU_ID, y = ~V_M_U_S35) %>%
add_markers()
nusantics_log %>%
plot_ly(x = ~OTU_ID, y = ~V_R_T_S18) %>%
add_markers()
nusantics_log %>%
plot_ly(x = ~OTU_ID, y = ~V_R_U_S23) %>%
add_markers()
nusantics_log %>%
plot_ly(x = ~OTU_ID, y = ~V_S_T_S5) %>%
add_markers()
nusantics_log %>%
plot_ly(x = ~OTU_ID, y = ~V_S_U_S11) %>%
add_markers()
nusantics_log %>%
plot_ly(x = ~OTU_ID, y = ~V_Sh_T_S24) %>%
add_markers()
nusantics_log %>%
plot_ly(x = ~OTU_ID, y = ~V_Sh_U_S29) %>%
add_markers()
nusantics_log %>%
plot_ly(x = ~OTU_ID, y = ~V_SO_1_S3 ) %>%
add_markers()
nusantics_log %>%
plot_ly(x = ~OTU_ID, y = ~V_SO_2_S10) %>%
add_markers()
nusantics_log %>%
plot_ly(x = ~OTU_ID, y = ~V_S0_3_S17) %>%
add_markers()
nusantics_log %>%
plot_ly(x = ~OTU_ID, y = ~V_V_T_S36) %>%
add_markers()
nusantics_log %>%
plot_ly(x = ~OTU_ID, y = ~V_V_U_S41) %>%
add_markers()
All of plot above is using log transformation on the raw data, it’s help make the plot more obvious to see, even though the scale is log transformation but the interpretation is pretty straightforward, the highest the y axis represent the highest abundance of bacteria contain. So after we plot all the sample in each plot, now we can know what bacteria have the highest abundance in each sample.
move on to Question 2.
Principal Coordinates Analysis (PCoA, = Multidimensional scaling, MDS) is a method to explore and to visualize similarities or dissimilarities of data. It starts with a similarity matrix or dissimilarity matrix (= distance matrix) and assigns for each item a location in a low-dimensional space, e.g. as a 3D graphics. quote from here.
so here we’re gonna make a distance matrix then we will plot to see dissimilarity between samples. and then plot it using pcoa() and plot_ly() function.
#Transform data
t_nusantics <- as.data.frame(t(nusantics[,-1]))
#Ploting PCoA
t_nusantics.bray <- vegdist(t_nusantics, "bray")
pcoa.t1 <- pcoa(t_nusantics.bray)
biplot(pcoa.t1)
plot_ly(x = ~pcoa.t1$vectors[,1], y = ~pcoa.t1$vectors[,2],
hoverinfo = "text",
text = ~paste("Sample :", sample)) %>%
add_markers()
Next to the last question.
Please read the attached paper MISH.pdf about a random-forest (RF) implementation to predict cases of atopic dermatitis. Imagine that we are building similar RF model to predict cases of a different disease, but we only have 120 samples. Among these, 90 of them are labelled as “non-healthy”, and 30 of them are labelled as “healthy” by a dermatologist. Given the imbalance and relatively low number of samples, how would you design the RF model so that it gives the best performance?
My Answer : In this context, unbalanced data refers to classification problems where we have unequal instances for different classes, surely this can be problematic in Machine learning classification. i’ve got some help in this blog that says “Most machine learning classification algorithms are sensitive to unbalance in the predictor classes, therefore With under-sampling, we randomly select a subset of samples from the class with more instances to match the number of samples coming from each class. In our example, we would randomly pick 30 out of the 90 non-health cases. The main disadvantage of under-sampling is that we lose potentially relevant information from the left-out samples.”
Here are some example for building RF model with unbalanced data :
source : https://shiring.github.io/machine_learning/2017/04/02/unbalanced