Segment RNA 3D structures using multiple clustering algorithms
Building on the success of Mean Shift, we further use its results to guide two parametric clustering algorithms: Agglomerative (hierarchical) clustering [3] and K-means [4]. Both methods require the number of clusters k to be specified in advance, which we set to the number of clusters detected by the Mean Shift algorithm. Additionally, we employ the Silhouette criterion as an alternative method for determining the optimal number of clusters, providing a complementary baseline [5].
The Silhouette score is a widely used internal evaluation metric for clustering quality, as it measures both the cohesion of points within the same cluster and the separation between different clusters. The formula of Silhouette is defined as follows:
where:
If s(i) is negative, it indicates that point i is closer, on average, to another cluster than to its own, suggesting a possible misassignment. If s(i) ≈ 0, the point lies on or near the boundary between two clusters. If s(i) > 0, the point is appropriately assigned to its cluster. The overall clustering quality is quantified by the mean Silhouette score across all data points:
Similar to the individual s(i), if S is negative, it means that the number of points that are misassigned is more than the number of points that are well-assigned. Conversely, if S is positive, it means that the number of points that are well-assigned is more than the number of points that are misassigned. By optimizing the global Silhouette score, we may get a good partitioning result. In this research, we tested the agglomerative and k-means clustering with different numbers of clusters from 2 to 5, then chose the one which have the highest global Silhouette score.
To implement graph-based algorithms, the input Cartesian coordinates from each PDB file must first be converted into a graph representation, where nodes represent residues, and edges are defined between residues if the distance between them is below a certain threshold. Since the C3′ atom directly participates in the phosphodiester bond that orients the RNA backbone, we used the C3′ coordinates to represent each residue. In our previous work, we found that the C3' atom offered a better coarse-grained representation than the C1'-only or C4'-only models. To determine the edge threshold, we analyzed the distribution of distances between consecutive C3′ atoms in the 132 structures of the RNA3DHub dataset. Although the vast majority of distances ranged from 6.5 to 8.0 Å, the longest one exceeded 14.0 Å. Therefore, we set the upper distance threshold to 15.0 Å, for building the graph edges in all datasets.
This value ensures that the backbone links are not interrupted, and that the graph has a dense level to form communities. All three graph-based algorithms used here rely on maximizing the modularity gain by different ways, to achieve the optimal partitioning. Modularity is a criterion for measuring the partitioning quality via the connection strength between communities [7]. The modularity score is calculated according to this formula [7,8]:
where:
This score ranges from −0.5 to 1.0. The essence of modularity score is to measure the difference between observed connection probability (Aᵢⱼ) and expected connection probability (γkᵢ·kⱼ/2m) of communities in the graph. The negative score value means the observed connection is smaller than the expected connection, or the average strength between communities in the graph is stronger (more edges) than within them. The zero score value means there is no difference in the strength between and within each community. The positive score means the observed connection is larger than the expected connection, meaning that there is a significant division into communities, with more edges within communities than expected by chance.
Because a coefficient γ > 1 will make the expected probability larger, the observed proportion needs to be significantly larger to have a notable difference (large Q value), thus obtaining small and tightly knit communities. Conversely, a coefficient γ < 1 will make the expected probability smaller, stretching the difference with positive Q, so the resulting communities tend to be large and less tightly knit. Therefore, γ is called a resolution parameter. This parameter is adjustable for the modularity-based algorithm (Louvain, Leiden, CNM) on the server.
The Mean Shift algorithm [1] is a distance-based clustering approach that has demonstrated its potential in clustering 3D RNA domains from two datasets RNA3DB and RNA3DHub with pertinent parameter values [2]. In our previous study, we kept orphans (points outside the bandwidth of centroids, considered as outliers) to detect non-domain regions (linkers). However, some results showed that considering outliers as non-domain regions is still flawed. Therefore, in this study, we let the algorithm cluster all, meaning that orphans will be assigned the label of the nearest centroid. See more about the Mean Shift and its parameters in Sklearn.
Agglomerative (hierarchical) clustering is a bottom-up clustering approach where data points are progressively merged to form larger clusters [3]. It starts by treating each sample as its own cluster. Initially, each data point is treated as its own singleton cluster. At each iteration, the algorithm identifies the pair of clusters that are closest under a chosen linkage criterion and merges them into a new cluster. The linkage strategies determine how distances between clusters are computed [6]:
During the bottom-up process, the algorithm implicitly builds a dendrogram that records the order and distance at which clusters were combined. The algorithm stops when it reaches a predefined criterion, either a certain number of clusters or a specified distance threshold. In this study, we use Ward's linkage, with the stopping criteria by the number of clusters k, with k being determined by the number of clusters returned by the Mean Shift algorithm with parameters that has been published in our previous work (uniform kernel and a bandwidth set to 0.2 quantile of pairwise distances) [2], or the number of clusters in which the Silhouette score is maximal. See more about the Agglomerative and its parameters in Sklearn.
K-means clustering is a partition-based algorithm that divides data into a predefined number of clusters by iteratively refining cluster centroids [4]. It begins by initializing k cluster centers (centroids), either by selecting them randomly or using the more robust k-means++ strategy [6]. The k-means++ initialization samples points with a probability proportional to their squared distance from the nearest existing centroid, which reduces the risk of poor local minima by spreading out the initial centers. The main iterative procedure alternates between two phases:
These phases repeat until convergence, which occurs when the positions of the centroids change significantly between iterations, or when a predefined maximum number of iterations is reached. The objective function minimized by k-means is the inertia:
where:
The K-means algorithm is efficient and scales well to large datasets. However, it assumes convex, roughly spherical clusters of similar size and requires k to be specified in advance. Therefore, in this study, we let k be the number of clusters returned by the Mean Shift algorithm with a uniform kernel and a bandwidth set to 0.2 quantile of pairwise distances, or the number of clusters in which the Silhouette score is maximal. See more about the K-Means and its parameters in Sklearn.
The Clauset-Newman-Moore (CNM) algorithm is a greedy method for detecting communities in networks by maximizing the modularity score between communities [9]. Starting with each node as an individual community, the algorithm considers all possible pairs of communities and merges the pair that yields the greatest modularity gain:
where eᵢⱼ is the total weight between communities cᵢ and cⱼ. This merging step is repeated iteratively until no further merge can increase modularity, yielding the final partition of the network. The CNM algorithm is straightforward and efficient for smaller networks, though it does not produce a hierarchical structure, which limits its ability to detect communities at multiple scales. For a graph with m edges and n nodes, the time complexity of this algorithm is O(m·n) or O(m·log(n)) with advanced data structures like priority queues. See more about the CNM and its parameters in NetworkX.
The Louvain algorithm is a widely used hierarchical method for community detection in networks [10]. Like CNM, it is based on the idea of maximizing the modularity score. However, it operates locally by refining communities through individual node movements, and extends this process across multiple levels through iterative aggregation. On each iteration, the Louvain algorithm has two main phases:
These two phases are repeated iteratively, allowing the algorithm to detect communities at multiple scales. The Louvain algorithm is efficient and scalable, making it suitable for large networks, and it often uncovers meaningful community structures while providing a hierarchical view of the data. The time complexity of the Louvain algorithm is better than the mentioned algorithms, which is O(n·log(n)) for a graph with n nodes.
One of the limitations of the Louvain algorithm is yielding weakly connected communities. This is because several key nodes in one community would move to nearby communities to achieve a better modularity gain, leaving the remaining nodes in that community spare or disconnected. To overcome that problem, Traag et al. have developed the Leiden algorithm [11]. See more about the Louvain and its parameters in NetworkX.
The Leiden algorithm is a community detection method that was inspired by the Louvain algorithm with additional refinements to improve partition quality and guarantee well-connected communities. It incorporates a refinement phase after a local optimization phase that splits and merges communities to avoid poorly connected communities, thereby reducing the risk of poorly connected or singleton communities.
The three phases are then repeated iteratively until there is no more improvement. Its performance, scalability, and ability to provide strongly connected communities make it suitable for large networks. The time complexity of the Leiden algorithm is O(n·log(n)), like the Louvain algorithm. See more about the Leiden and its parameters in this leidenalg library webpage.
DomGen is a graph-based algorithm specialized in predicting protein structural domains [12]. It first models the 3D structure of a protein as a graph of vertices and edges, where each residue (amino acid) is a vertex, and an edge is formed between two vertices if the distance between their Cα (Cαᵢ, Cαⱼ) or the distance between the centers of side-chain (Sᵢ, Sⱼ) satisfies certain conditions, where r is a distance threshold, typically set to 4.5 Å.
After that, the algorithm starts coloring from the vertex with the most edges (the largest node degree) to the vertex with the fewest edges (the smallest node degree). The coloring process follows the following rule: If the vertex and its neighbors are not colored, color the vertex with a new color. If the vertex is not colored but its neighbors are colored, color the vertex with the most common color among its neighbors. If the vertex is colored, do nothing. After coloring, the vertex will transmit its color to all its neighbors that are not yet colored and are not adjacent to it in the protein sequence. This process yields initial small clusters representing domain "cores". DomGen then evaluates the quality of each cluster through the ratio of out-cluster and in-cluster linkages to refine the cluster so that it is biologically meaningful:
where:
If qᵢ > 1, it means that the total edge weight between cluster Dᵢ and all other clusters is greater than the total edge weight inside Dᵢ itself. In other words, residues in Dᵢ are more connected to residues outside their cluster than to those inside it. Therefore, cluster Dᵢ is considered unstable and is invalidated by recoloring all its vertices grey. These grey vertices are then returned to the pool of unassigned nodes and will be reassigned to neighboring clusters. To merge small clusters into larger, coherent domains, DomGen uses the merging ratio:
Clusters are merged if their merging ratio is greater than or equal to a certain threshold, which is set to 0.41. Additionally, edges between consecutive residues belonging to different clusters are strengthened by a weight increment wc (default wc=5) to promote sequence continuity. In this work, we keep the wc = 5, and perform tuning on the invalid threshold t for quality score and merging threshold from 0 to 1, with a discretization step of 0.1.
UniDoc is a hierarchical-based algorithm that has two processes: top-down and bottom-up [13]. The idea behind this algorithm is to minimize inter-domain interactions through a top-down process, while maximizing intra-domain interactions through a bottom-up process. First, UniDoc takes the distance matrix between residues (Cα or Cβ) as input, transforming it into a contact probability matrix using a logistic function. Scoring functions of UniDoc work on the inter-domain score (DISinter) and intra-domain score (DISintra):
where:
The algorithm performs the top-down process: Divide the protein into fragments by splitting continuously or discontinuously, so that DISinter is minimized, ensuring the fragments have a minimum size. For a continuous split, the scoring function to minimize is to find the position k on the parent domain D of length l, in which:
where D₁k', D₂k' are the new segments from D by a single-cut at position k'. For a discontinuous split, the objective is to find the positions t and s on the parent domain D in which:
where D₁t',s', D₂t',s' are the new segments from D by a double-cut at positions t' and s'. Also, the split will only be accepted if the DISinter(D₁,D₂) is less than a certain threshold value s (i.e., 0.5) of the DISintra(D). Subsequently, a bottom-up merging process is applied: two fragments are merged if they exhibit a strong intra-domain similarity, defined by maximizing the difference between DISinter and DISintra:
where m is set to 1 in the original study [13]. The merging is only accepted if S(i,j) > 0. Finally, the algorithm has an additional post-processing step to handle small fragments or domains that are not coherent. Any fragment with weak internal cohesion (DISintra < 1) will be merged with another fragment that has the strongest interaction with them (i.e., the highest value of DISinter). In this chapter, we decided to tune the factor s, which is the ratio of DISinter of the child fragments to DISintra of the parent fragment, from 0.1 to 1, and the merge threshold m factor, which is the ratio of DISinter of the 2 fragments to the minimum DISintra of one of them, from 0.2 to 2.
The literature contains several methods which decompose proteins into 3D domains by following a top-down/bottom-up approach, such as DomainParser [14,15], PDP (Protein Domain Parser) [16], DDOMAIN [17] or SWORD [18,19]. The convenience of this bidirectional framework is that researchers can test it with different scoring functions for each of the top-down and bottom-up processes—i.e., the criteria for cutting the 3D structure can be different from those for merging the resulting elements into domains. In this study, we introduce a new graph-based bidirectional hierarchical clustering (BiHC) algorithm, which performs a top-down/bottom-up segmentation using two different scoring functions: the modularity (as seen in Section 2.1.2) for the divisive phase of the hierarchical clustering (top-down), and our new scoring function for the agglomerative phase (bottom-up).
First, for the top-down process, to be able to use the modularity function, we need to convert the input from Cartesian coordinate data - the data type that the UniDoc algorithm works on, into a graph. Then, this new algorithm works similarly to UniDoc: randomly cutting at each iteration so that the modularity gain is the largest, creating 2 subgraphs. Then, the algorithm checks each sub-segment to continue cutting. The process stops when there are no more cutting points that give a valid modularity Q (greater than a given threshold).
The bottom-up process takes as input the segments resulting from the top-down process. We also utilize a graph-based scoring function known as the interconnection score. The inter-connection score measures the proportion of nodes involved in the inter-connection between two communities (i.e., the contact ratio between two domains):
where:
We can see that, if φ is the minimum, the algorithm will prioritize merging partitions of similar size. This is because the number of nodes of a large subgraph that are in contact with a much smaller subgraph may be small compared to the total nodes, resulting in a low ratio of nodes in contact. Conversely, if φ is maximum, the algorithm will prioritize merging small partitions with large partitions because small subgraphs may have a large ratio of nodes that contact with nodes in larger subgraphs. In this study, we tested the bottom-up process with both φ is min and max.