EvryRNA : RNANet

RNANet

 

  A major part of any data-science work consists in finding appropriate data which contains enough signal to tackle the problem we are interested in. Then, cleaning the data to ensure uniformity of the measures, compatibility of the various data sources and protocols, and a reasonable amount of noise is sometimes the most time-consuming step.

  Here we propose a first attempt of standardized and automatically generated dataset dedicated to RNA combining together: RNA sequences, homology information (nucleotide frequencies for every position in a 3D chain), and information derived by annotation of available 3D structures (including secondary structure, canonical and non-canonical interactions, and backbone torsion angles). We hope this dataset will speed-up advances in machine-learning based approaches for RNA secondary and/or 3D structure prediction, by avoiding spending time on data gathering and cleaning.

 

RNANet pipeline schema

RNANet Downloads

 


SQLite3 Database

Text files (CSV)

Git repository

  • RNANet.db : A SQLite3 database containing all the information. You might want to query it to build your own sub-datasets.
  • flat-text-files : CSV files summarizing the information for every RNA 3D chain (1 file per 3D chain mapped on a Rfam family)

Extract the archives using commands gunzip RNANet.db.gz to recreate RNANet.db and tar -xvzf RNANET_datapoints_latest.tar.gz to recreate a folder of text files.

Additional Metadata:

  • summary.csv : Additional information about the previous RNA chains (date of publication, resolution, basepair types counts)
  • families.csv : Additional information about the Rfam RNA families used (number of 3D chains, number of homologous sequences)
  • frequencies.csv : Nucleotide frequencies by RNA family, including modified bases
  • pair_types.csv : Basepair-type frequencies by RNA family, including only intra-chain base-base interactions, in Leontis-Westhof nomenclature

The metadata provided in the database tables, and pre-extracted in supplementary CSV files, might be useful to you to assert the dataset quality, or perform further filtering.

You can also browse all past releases of the flat-text files (approx. one per month)

Descriptors

For each RNA chain available in 3D and mapped to a RNA family, we provide the following list of descriptors:

 

Descriptor

Label

Type

Index of the residue in the chain (from 1 to N) index_chain int > 1
Index of the residue in the source mmCIF file nt_resnum int > 1
Position of the nucleotide in the chain, normalized by its length (value between 0 and 1) nt_position float
Nucleotide name, including modified bases (like 5MC) nt_name str
One-letter name. Lowercase "acgtu" letters are used for modified "ACGTU" bases nt_code char
Letter used for sequence alignment. Gaps are replaced by the consensus base at the end. nt_align_code char
One-hot encoded sequence. 'other' contains gaps, unknown and modified nucleotides is_A, is_C, is_G, is_U, is_other 0 or 1
Nucleotide frequencies (PSSM) at the current position in this RNA family freq_A, freq_C, freq_G, freq_U, freq_other float
Secondary structure in dot-bracket notation of this position dbn char
Zero, or comma-separated values of index_chain of the nucleotide(s) which is(are) paired int, int, ...
paired with this one. Canonical (Watson-Crick or Wobble) basepairs are first in the list.    
Number of bases interacting with this one nb_interact int > 0
Type of basepair in Leontis-Westhof nomenclature (comma-separated list) pair_type_LW str, str, ...
Type of basepair in DSSR nomenclature (comma-separated list) pair_type_DSSR str, str, ...
The six torsion angles of the backbone, from 5' to 3', between 0 and 2pi alpha, beta, gamma, delta, epsilon, zeta float (rad)
Difference between epsilon and zeta torsion angles epsilon_zeta float (rad)
Conformation of the backbone bb_type BI, BII, '..', or 'n/a'
Chi torsion angle (between ribose and base) chi float (rad)
Conformation of the sugar with respect to the base (depends on Chi) glyco_bond syn or anti
Torsion angles of the ribose cycle v0, v1, v2, v3, v4 float (rad)
If the nucleotide is involved in a stem, the stem type form A, B, Z or '.'
Z-coordinate of the 3' phosphorus atom with reference to the 5' base plane ssZp float
Perpendicular distance of the 3' P atom to the glycosidic bond Dp float
Pseudotorsions between P and C1' eta, theta float (rad)
Pseudotorsions between P and C4' eta_prime, theta_prime float (rad)
Pseudotorsions between P and the base center eta_base, theta_base float (rad)
Conformation of the ribose cycle phase_angle float (rad)
Amplitude of the sugar puckering amplitude float
Conformation of the ribose cycle (10 classes corresponding to specific ranges of phase) puckering str

Quick example

It is possible to build ub-datasets by querying the results/RNANet.db file. We provide examples using Python3 and the sqlite3 package:

import sqlite3
import pandas as pd

with sqlite3.connect("results/RNANet.db) as connection:
    df = pd.read_sql("""SELECT structure_id, chain_name
                        FROM chain JOIN structure
                        WHERE resolution < 4.0 ORDER BY date ASC;""", con=connection)

df.to_csv("my_custom_results.csv")

A description of the database tables and fields and more examples of SQL queries can be found in the README.md file on the IBISC forge, see section How to further filter the dataset. The database scheme is illustrated below:
RNANet database schema

How to cite RNANet:
  • Becquey, L., Angel, E., & Tahi, F. Towards a reference dataset of RNA structures for machine learning applications. (upcoming)
Additional references:
  • The "ProteinNet" philosophy which inspired this work:
  • AlQuraishi, M. (2019b). ProteinNet: A standardized data set for machine learning of protein structure. BMC Bioinformatics, 20(1), 311
  • If you use our annotations by DSSR, you might want to cite:
  • Lu, X.-J.et al.(2015). DSSR: An integrated software tool for dissecting the spatial structure of RNA. Nucleic Acids Research, 43(21), e142–e142.
  • If you use our multiple sequence alignments and homology data, you might want to cite:
  • Pruesse, E.et al.(2012). Sina: accurate high-throughput multiple sequence alignment of ribosomal RNA genes. Bioinformatics, 28(14), 1823–1829
  • Nawrocki, E. P. and Eddy, S. R. (2013). Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics, 29(22), 2933–2935.
For any questions, comments or suggestions about RNANet, please feel free to contact: fariza.tahi@ibisc.univ-evry.fr