Scoring functions

Silhouette Score

Building on the success of Mean Shift, we further use its results to guide two parametric clustering algorithms: Agglomerative (hierarchical) clustering [3] and K-means [4]. Both methods require the number of clusters k to be specified in advance, which we set to the number of clusters detected by the Mean Shift algorithm. Additionally, we employ the Silhouette criterion as an alternative method for determining the optimal number of clusters, providing a complementary baseline [5].

The Silhouette score is a widely used internal evaluation metric for clustering quality, as it measures both the cohesion of points within the same cluster and the separation between different clusters. The formula of Silhouette is defined as follows:

$$ s(i) = \frac{b(i) - a(i)}{\max(a(i),\, b(i))}, \quad -1 \le s(i) \le 1 $$

where:

s(i): the Silhouette score of point i of a clustering result;
a(i): the mean intra-cluster distances between i and other points in cluster a, which i belongs to;
b(i): the minimum mean distance between i and any other clusters;

If s(i) is negative, it indicates that point i is closer, on average, to another cluster than to its own, suggesting a possible misassignment. If s(i) ≈ 0, the point lies on or near the boundary between two clusters. If s(i) > 0, the point is appropriately assigned to its cluster. The overall clustering quality is quantified by the mean Silhouette score across all data points:

$$ S = \frac{1}{n} \sum_{i=1}^n s(i), \quad -1 \le S \le 1 $$

Similar to the individual s(i), if S is negative, it means that the number of points that are misassigned is more than the number of points that are well-assigned. Conversely, if S is positive, it means that the number of points that are well-assigned is more than the number of points that are misassigned. By optimizing the global Silhouette score, we may get a good partitioning result. In this research, we tested the agglomerative and k-means clustering with different numbers of clusters from 2 to 5, then chose the one which have the highest global Silhouette score.

Modularity Score

To implement graph-based algorithms, the input Cartesian coordinates from each PDB file must first be converted into a graph representation, where nodes represent residues, and edges are defined between residues if the distance between them is below a certain threshold. Since the C3′ atom directly participates in the phosphodiester bond that orients the RNA backbone, we used the C3′ coordinates to represent each residue. In our previous work, we found that the C3' atom offered a better coarse-grained representation than the C1'-only or C4'-only models. To determine the edge threshold, we analyzed the distribution of distances between consecutive C3′ atoms in the 132 structures of the RNA3DHub dataset. Although the vast majority of distances ranged from 6.5 to 8.0 Å, the longest one exceeded 14.0 Å. Therefore, we set the upper distance threshold to 15.0 Å, for building the graph edges in all datasets.

This value ensures that the backbone links are not interrupted, and that the graph has a dense level to form communities. All three graph-based algorithms used here rely on maximizing the modularity gain by different ways, to achieve the optimal partitioning. Modularity is a criterion for measuring the partitioning quality via the connection strength between communities [7]. The modularity score is calculated according to this formula [7,8]:

$$ Q = \frac{1}{2m} \sum_{i,j} \left[ A_{ij} - \gamma \frac{k_i k_j}{2m} \right] \delta(c_i, c_j) $$

where:

Q: modularity score of the partitioning of the graph;
m: number of edges of the Graph (or the sum of all edge weights for a weighted graph);
Aᵢⱼ: adjacency score between node i and j (0 or 1);
γ: resolution parameter (= 1 by default);
kᵢ, kⱼ: the (weighted) degree of nodes i and j;
δ(cᵢ, cⱼ): binary score (= 1 if node i and j are in the same community, = 0 otherwise);

This score ranges from −0.5 to 1.0. The essence of modularity score is to measure the difference between observed connection probability (Aᵢⱼ) and expected connection probability (γkᵢ·kⱼ/2m) of communities in the graph. The negative score value means the observed connection is smaller than the expected connection, or the average strength between communities in the graph is stronger (more edges) than within them. The zero score value means there is no difference in the strength between and within each community. The positive score means the observed connection is larger than the expected connection, meaning that there is a significant division into communities, with more edges within communities than expected by chance.

Because a coefficient γ > 1 will make the expected probability larger, the observed proportion needs to be significantly larger to have a notable difference (large Q value), thus obtaining small and tightly knit communities. Conversely, a coefficient γ < 1 will make the expected probability smaller, stretching the difference with positive Q, so the resulting communities tend to be large and less tightly knit. Therefore, γ is called a resolution parameter. This parameter is adjustable for the modularity-based algorithm (Louvain, Leiden, CNM) on the server.

Algorithms

1. Mean Shift

The Mean Shift algorithm [1] is a distance-based clustering approach that has demonstrated its potential in clustering 3D RNA domains from two datasets RNA3DB and RNA3DHub with pertinent parameter values [2]. In our previous study, we kept orphans (points outside the bandwidth of centroids, considered as outliers) to detect non-domain regions (linkers). However, some results showed that considering outliers as non-domain regions is still flawed. Therefore, in this study, we let the algorithm cluster all, meaning that orphans will be assigned the label of the nearest centroid. See more about the Mean Shift and its parameters in Sklearn.

2. Agglomerative

Agglomerative (hierarchical) clustering is a bottom-up clustering approach where data points are progressively merged to form larger clusters [3]. It starts by treating each sample as its own cluster. Initially, each data point is treated as its own singleton cluster. At each iteration, the algorithm identifies the pair of clusters that are closest under a chosen linkage criterion and merges them into a new cluster. The linkage strategies determine how distances between clusters are computed [6]:

Single linkage: minimum distance between any pair of points across clusters;
Complete linkage: maximum distance between any pair of points across clusters;
Average linkage (UPGMA): mean distance between all pairs of points across clusters;
Ward's linkage: increase in total within-cluster variance after merging clusters;

During the bottom-up process, the algorithm implicitly builds a dendrogram that records the order and distance at which clusters were combined. The algorithm stops when it reaches a predefined criterion, either a certain number of clusters or a specified distance threshold. In this study, we use Ward's linkage, with the stopping criteria by the number of clusters k, with k being determined by the number of clusters returned by the Mean Shift algorithm with parameters that has been published in our previous work (uniform kernel and a bandwidth set to 0.2 quantile of pairwise distances) [2], or the number of clusters in which the Silhouette score is maximal. See more about the Agglomerative and its parameters in Sklearn.

3. K-means

K-means clustering is a partition-based algorithm that divides data into a predefined number of clusters by iteratively refining cluster centroids [4]. It begins by initializing k cluster centers (centroids), either by selecting them randomly or using the more robust k-means++ strategy [6]. The k-means++ initialization samples points with a probability proportional to their squared distance from the nearest existing centroid, which reduces the risk of poor local minima by spreading out the initial centers. The main iterative procedure alternates between two phases:

Assignment phase: Each data point is assigned to the cluster with the nearest centroid, typically using Euclidean distance;
Update phase: The centroid of each cluster is recomputed as the mean of all points assigned to that cluster.

These phases repeat until convergence, which occurs when the positions of the centroids change significantly between iterations, or when a predefined maximum number of iterations is reached. The objective function minimized by k-means is the inertia:

$$ J = \sum_{i=1}^{k} \sum_{x \in C_i} \| x - \mu_i \|^2 $$

where:

J: inertia of the data points;
k: number of clusters;
Cᵢ: set points of cluster i;
x: coordinate of point x;
μᵢ: coordinate of centroid of cluster i;

The K-means algorithm is efficient and scales well to large datasets. However, it assumes convex, roughly spherical clusters of similar size and requires k to be specified in advance. Therefore, in this study, we let k be the number of clusters returned by the Mean Shift algorithm with a uniform kernel and a bandwidth set to 0.2 quantile of pairwise distances, or the number of clusters in which the Silhouette score is maximal. See more about the K-Means and its parameters in Sklearn.

4. Clauset-Newman-Moore

The Clauset-Newman-Moore (CNM) algorithm is a greedy method for detecting communities in networks by maximizing the modularity score between communities [9]. Starting with each node as an individual community, the algorithm considers all possible pairs of communities and merges the pair that yields the greatest modularity gain:

$$ \Delta Q(c_i, c_j) = \frac{1}{2m} \left[ e_{ij} - \gamma \frac{k_i k_j}{2m} \right] $$

where eᵢⱼ is the total weight between communities cᵢ and cⱼ. This merging step is repeated iteratively until no further merge can increase modularity, yielding the final partition of the network. The CNM algorithm is straightforward and efficient for smaller networks, though it does not produce a hierarchical structure, which limits its ability to detect communities at multiple scales. For a graph with m edges and n nodes, the time complexity of this algorithm is O(m·n) or O(m·log(n)) with advanced data structures like priority queues. See more about the CNM and its parameters in NetworkX.

5. Louvain

The Louvain algorithm is a widely used hierarchical method for community detection in networks [10]. Like CNM, it is based on the idea of maximizing the modularity score. However, it operates locally by refining communities through individual node movements, and extends this process across multiple levels through iterative aggregation. On each iteration, the Louvain algorithm has two main phases:

Local optimization: Each node is initially assigned to its own community. Then, for every node i, the algorithm evaluates the change in modularity ΔQ obtained by moving i into the community of each of its neighbors. The node is reassigned to the community that yields the maximum positive ΔQ. This process is repeated iteratively and sequentially for all nodes until no further improvement in modularity is possible. After this phase, some communities will be empty, while some may contain several nodes;
Aggregation: Once the local optimization converges, the detected communities are aggregated into "super-nodes", resulting in a smaller, coarser network. Edges between communities are translated into weighted edges between super-nodes. The algorithm then proceeds to the next iteration, performing local optimization on the reduced network.

These two phases are repeated iteratively, allowing the algorithm to detect communities at multiple scales. The Louvain algorithm is efficient and scalable, making it suitable for large networks, and it often uncovers meaningful community structures while providing a hierarchical view of the data. The time complexity of the Louvain algorithm is better than the mentioned algorithms, which is O(n·log(n)) for a graph with n nodes.

One of the limitations of the Louvain algorithm is yielding weakly connected communities. This is because several key nodes in one community would move to nearby communities to achieve a better modularity gain, leaving the remaining nodes in that community spare or disconnected. To overcome that problem, Traag et al. have developed the Leiden algorithm [11]. See more about the Louvain and its parameters in NetworkX.

6. Leiden

The Leiden algorithm is a community detection method that was inspired by the Louvain algorithm with additional refinements to improve partition quality and guarantee well-connected communities. It incorporates a refinement phase after a local optimization phase that splits and merges communities to avoid poorly connected communities, thereby reducing the risk of poorly connected or singleton communities.

Local optimization: This phase is the same as in the Louvain algorithm. Each node is initially placed in its own community. For each node i, the algorithm evaluates the change in modularity ΔQ when moving i into the community of one of its neighbors. The node is reassigned to the community that yields the maximum positive gain, and this process is repeated until no further local move can increase modularity;
Refinement phase: This is an extra phase of the Leiden algorithm compared to the Louvain algorithm. Each community is examined internally, and nodes that are not sufficiently well-connected to the rest of the community are separated. This avoids the creation of "bad" communities, such as singletons, disconnected communities, and weakly connected communities;
Aggregation: After the refinement phase, the communities are aggregated into "super-nodes" to have a smaller, coarser graph, similar to Louvain.

The three phases are then repeated iteratively until there is no more improvement. Its performance, scalability, and ability to provide strongly connected communities make it suitable for large networks. The time complexity of the Leiden algorithm is O(n·log(n)), like the Louvain algorithm. See more about the Leiden and its parameters in this leidenalg library webpage.

7. DomGen

DomGen is a graph-based algorithm specialized in predicting protein structural domains [12]. It first models the 3D structure of a protein as a graph of vertices and edges, where each residue (amino acid) is a vertex, and an edge is formed between two vertices if the distance between their Cα (Cαᵢ, Cαⱼ) or the distance between the centers of side-chain (Sᵢ, Sⱼ) satisfies certain conditions, where r is a distance threshold, typically set to 4.5 Å.

After that, the algorithm starts coloring from the vertex with the most edges (the largest node degree) to the vertex with the fewest edges (the smallest node degree). The coloring process follows the following rule: If the vertex and its neighbors are not colored, color the vertex with a new color. If the vertex is not colored but its neighbors are colored, color the vertex with the most common color among its neighbors. If the vertex is colored, do nothing. After coloring, the vertex will transmit its color to all its neighbors that are not yet colored and are not adjacent to it in the protein sequence. This process yields initial small clusters representing domain "cores". DomGen then evaluates the quality of each cluster through the ratio of out-cluster and in-cluster linkages to refine the cluster so that it is biologically meaningful:

$$ q_i = \frac{\sum_{j \ne i} \text{links}(D_i, D_j)}{\text{links}(D_i, D_i)} $$

where:

qᵢ: Quality score of cluster Dᵢ;
links(Dᵢ, Dⱼ) = Σᵤ∈Dᵢ,ᵥ∈Dⱼ wᵤᵥ: Sum of total edge weights between nodes in cluster Dᵢ and cluster Dⱼ;

If qᵢ > 1, it means that the total edge weight between cluster Dᵢ and all other clusters is greater than the total edge weight inside Dᵢ itself. In other words, residues in Dᵢ are more connected to residues outside their cluster than to those inside it. Therefore, cluster Dᵢ is considered unstable and is invalidated by recoloring all its vertices grey. These grey vertices are then returned to the pool of unassigned nodes and will be reassigned to neighboring clusters. To merge small clusters into larger, coherent domains, DomGen uses the merging ratio:

$$ m_{ij} = \frac{\text{links}(D_i, D_j)}{\text{links}(D_i, D_i)} $$

Clusters are merged if their merging ratio is greater than or equal to a certain threshold, which is set to 0.41. Additionally, edges between consecutive residues belonging to different clusters are strengthened by a weight increment wc (default wc=5) to promote sequence continuity. In this work, we keep the wc = 5, and perform tuning on the invalid threshold t for quality score and merging threshold from 0 to 1, with a discretization step of 0.1.

8. UniDoc

UniDoc is a hierarchical-based algorithm that has two processes: top-down and bottom-up [13]. The idea behind this algorithm is to minimize inter-domain interactions through a top-down process, while maximizing intra-domain interactions through a bottom-up process. First, UniDoc takes the distance matrix between residues (Cα or Cβ) as input, transforming it into a contact probability matrix using a logistic function. Scoring functions of UniDoc work on the inter-domain score (DIS_inter) and intra-domain score (DIS_intra):

$$ p_{ij} = \frac{1}{1 + e^{\left(\frac{d_{ij} - d_0}{\delta}\right)}} $$ $$ \text{DIS}_{\text{inter}}(D_1, D_2) = \frac{1}{l_1^{\alpha} \, l_2^{\alpha}} \sum_{i \in D_1} \sum_{j \in D_2} p_{ij} $$ $$ \text{DIS}_{\text{intra}}(D) = \frac{1}{l^{\beta}} \sum_{i \in D} \sum_{\substack{j \in D \\ |i - j| > 2}} p_{ij} $$

where:

pᵢⱼ: The contact probability of residue i and residue j;
DIS_inter(D₁, D₂): The inter-domain score between domain D₁ and domain D₂;
l₁, l₂: The length of domain D₁ and domain D₂;
DIS_intra(D): The intra-domain score of domain D;
dᵢⱼ: The distance between residue i and residue j;
d₀: The cut-off distance, set to 8 Å;
δ: The smoothing parameter, set to 1.5 Å;
α: The empirical parameter showing the expected ratio between the number of residues in contact between 2 domains on their surface area, set to 0.43;
β: The empirical parameter, set to 0.95;

The algorithm performs the top-down process: Divide the protein into fragments by splitting continuously or discontinuously, so that DIS_inter is minimized, ensuring the fragments have a minimum size. For a continuous split, the scoring function to minimize is to find the position k on the parent domain D of length l, in which:

$$ k = argmin_{1 < k' < l} \text{DIS}_{\text{inter}}(D_{1}^{k'},\, D_{2}^{k'}) $$

where D₁_k', D₂_k' are the new segments from D by a single-cut at position k'. For a discontinuous split, the objective is to find the positions t and s on the parent domain D in which:

$$ (t, s) = \arg\min_{\substack{ 1 < t',\, s' < l \\ s' > t' + 35 \\ d_{s't'} < 8 \text{Å} }} \text{DIS}_{\text{inter}}(D_{1}^{t',s'},\, D_{2}^{t',s'}) $$

where D₁_t',s', D₂_t',s' are the new segments from D by a double-cut at positions t' and s'. Also, the split will only be accepted if the DIS_inter(D₁,D₂) is less than a certain threshold value s (i.e., 0.5) of the DIS_intra(D). Subsequently, a bottom-up merging process is applied: two fragments are merged if they exhibit a strong intra-domain similarity, defined by maximizing the difference between DIS_inter and DIS_intra:

$$ S(i, j) = \text{DIS}_{\text{inter}}(D_i, D_j) - m \times \min\!\Big( \text{DIS}_{\text{intra}}(D_i),\, \text{DIS}_{\text{intra}}(D_j) \Big) $$

where m is set to 1 in the original study [13]. The merging is only accepted if S(i,j) > 0. Finally, the algorithm has an additional post-processing step to handle small fragments or domains that are not coherent. Any fragment with weak internal cohesion (DIS_intra < 1) will be merged with another fragment that has the strongest interaction with them (i.e., the highest value of DIS_inter). In this chapter, we decided to tune the factor s, which is the ratio of DIS_inter of the child fragments to DIS_intra of the parent fragment, from 0.1 to 1, and the merge threshold m factor, which is the ratio of DIS_inter of the 2 fragments to the minimum DIS_intra of one of them, from 0.2 to 2.

9. Bidirectional Hierarchical Clustering (BiHC)

The literature contains several methods which decompose proteins into 3D domains by following a top-down/bottom-up approach, such as DomainParser [14,15], PDP (Protein Domain Parser) [16], DDOMAIN [17] or SWORD [18,19]. The convenience of this bidirectional framework is that researchers can test it with different scoring functions for each of the top-down and bottom-up processes—i.e., the criteria for cutting the 3D structure can be different from those for merging the resulting elements into domains. In this study, we introduce a new graph-based bidirectional hierarchical clustering (BiHC) algorithm, which performs a top-down/bottom-up segmentation using two different scoring functions: the modularity (as seen in Section 2.1.2) for the divisive phase of the hierarchical clustering (top-down), and our new scoring function for the agglomerative phase (bottom-up).

First, for the top-down process, to be able to use the modularity function, we need to convert the input from Cartesian coordinate data - the data type that the UniDoc algorithm works on, into a graph. Then, this new algorithm works similarly to UniDoc: randomly cutting at each iteration so that the modularity gain is the largest, creating 2 subgraphs. Then, the algorithm checks each sub-segment to continue cutting. The process stops when there are no more cutting points that give a valid modularity Q (greater than a given threshold).

Algorithm 1: Top-down segmentation
Input: segment S, threshold t

L = {1,2,...,|S|}

# Single-cut point

p1* = argmax_i∈L Q(S[1:i], S[i+1:|S|])

S11 = S[1:p1*]

S12 = S[p1*+1:|S|]

Q1 = Q(S11, S12)

# Double-cut point

p2*, p3* = argmax_j,k∈L Q(S[1:j] ∪ S[k+1:|S|], S[j+1:k])

S21 = S[1:p2*] ∪ S[p3*+1:|S|]

S22 = S[p2*+1:p3*]

Q2 = Q(S21, S22)

# Select the best cut

Q* = max(Q1, Q2)

p* = argmax Q*

if Q* ≥ t then

if p* = single-cut then

S1 = TopDown(S11)

S2 = TopDown(S12)

else

S1 = TopDown(S21)

S2 = TopDown(S22)

end if

S* = S1 ∪ S2

else

S* = S

end if

Output: Segmentation S*

The bottom-up process takes as input the segments resulting from the top-down process. We also utilize a graph-based scoring function known as the interconnection score. The inter-connection score measures the proportion of nodes involved in the inter-connection between two communities (i.e., the contact ratio between two domains):

$$ R(C_1, C_2) = \varphi\!\left( \frac{I_{C_1}}{N_{C_1}},\, \frac{I_{C_2}}{N_{C_2}} \right) $$

where:

R(C₁, C₂): inter-connection score between C₁ and C₂;
I_c1: Number of nodes in C₁ that have connection with nodes in C₂;
I_c2: Number of nodes in C₂ that have connection with nodes in C₁;
N_c1: Number of nodes in C₁;
N_c2: Number of nodes in C₂;
φ: aggregation operator (min, max, average, etc.);

Algorithm 2: Bottom-up algorithm
Input: segmentation Σ = {S₁, S₂, ..., S_n}, threshold b

# Initiate the iterative merge

while |Σ| > 1 do

Initialize Q* ← 0;

# Find the pair of segments that has the best R value

for each Sᵢ ∈ Σ do:

Let Σᵢ = Σ\{Sᵢ};

Find Sⱼ* = argmax_Sⱼ∈Σᵢ R(Sᵢ,Sⱼ);

Q* ← max{Q*,R(Sᵢ,Sⱼ*)}

end

# Perform the merge if the best R value ≥ threshold

if Q* ≥ b then

Find S* = argmax_{Sᵢ,Sⱼ∈Σ} R(Sᵢ,Sⱼ);

Let S_m = Sᵢ ∪ Sⱼ*;

Σ ← (Σ\ {Sᵢ,Sⱼ*}) ∪ {S_m};

end

# End the merging cycle otherwise

else then

break;

end

Output: Final segmentation Σ

We can see that, if φ is the minimum, the algorithm will prioritize merging partitions of similar size. This is because the number of nodes of a large subgraph that are in contact with a much smaller subgraph may be small compared to the total nodes, resulting in a low ratio of nodes in contact. Conversely, if φ is maximum, the algorithm will prioritize merging small partitions with large partitions because small subgraphs may have a large ratio of nodes that contact with nodes in larger subgraphs. In this study, we tested the bottom-up process with both φ is min and max.

References

1. Fukunaga K, Hostetler L. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Inf. Theory 1975;21:32–40.

2. Le QK, Angel E, Tahi F, Postic G. Semi-supervised segmentation of RNA 3D structures using density-based clustering. Comput. Struct. Biotechnol. J. 2025;27:3966–84.

3. Johnson SC. Hierarchical Clustering Schemes. Psychometrika 1967;32:241–54.

4. Lloyd S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982;28:129–37.

5. Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987;20:53–65.

6. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res 2011;12:2825–30.

7. Newman MEJ. Modularity and community structure in networks. Proc. Natl. Acad. Sci. U. S. A. 2006;103:8577–82.

8. Newman MEJ. Introduction: A short introduction to networks and why we study them [Internet]. In: Newman M, editor. Networks: An Introduction. Oxford University Press; 2010 [cited 2025 Oct 22]. page 0. Available from: https://doi.org/10.1093/acprof:oso/9780199206650.003.0001

9. Clauset A, Newman MEJ, Moore C. Finding community structure in very large networks. Phys. Rev. E 2004;70:066111.

10. Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008;2008:P10008.

11. Traag VA, Waltman L, Van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 2019;9:5233.

12. Milostan M, Lukasiak P. DomGen-Graph based method for protein domain delineation. RAIRO - Oper. Res. 2016;50:363–74.

13. Zhu K, Su H, Peng Z, Yang J. A unified approach to protein domain parsing with inter-residue distance matrix. Bioinformatics 2023;39:btad070.

14. Xu Y, Xu D, Gabow HN. Protein domain decomposition using a graph-theoretic approach. Bioinformatics 2000;16:1091–104.

15. Guo J tao, Xu D, Kim D, Xu Y. Improving the performance of DomainParser for structural domain partition using neural network. Nucleic Acids Res. 2003;31:944–52.

16. Alexandrov N, Shindyalov I. PDP: protein domain parser. Bioinformatics 2003;19:429–30.

17. Zhou H, Xue B, Zhou Y. DDOMAIN: Dividing structures into domains using a normalized domain–domain interaction profile. Protein Sci. Publ. Protein Soc. 2007;16:947–55.

18. Postic G, Ghouzam Y, Chebrek R, Gelly JC. An ambiguity principle for assigning protein structural domains. Sci. Adv. 2017;3:e1600552.

19. Cretin G, Galochkina T, Vander Meersche Y, de Brevern AG, Postic G, Gelly JC. SWORD2: hierarchical analysis of protein 3D structures. Nucleic Acids Res. 2022;50:W732–8.