JOURNAL PUBLICATIONS

View in journal's website Access Software

PANDA-3D: protein function prediction based on AlphaFold models

Abstract

Previous protein function predictors primarily make predictions from amino acid sequences instead of tertiary structures because of the limited number of experimentally determined structures and the unsatisfying qualities of predicted structures. AlphaFold recently achieved promising performances when predicting protein tertiary structures, and the AlphaFold protein structure database (AlphaFold DB) is fast-expanding. Therefore, we aimed to develop a deep-learning tool that is specifically trained with AlphaFold models and predict GO terms from AlphaFold models. We developed an advanced learning architecture by combining geometric vector perceptron graph neural networks and variant transformer decoder layers for multi-label classification. PANDA-3D predicts gene ontology (GO) terms from the predicted structures of AlphaFold and the embeddings of amino acid sequences based on a large language model. Compared with an existing deep-learning method that makes predictions based on experimentally determined tertiary structures and an existing deep-learning method with amino acid sequences as input, our method outperformed them by significant margins. PANDA-3D is tailored to AlphaFold models, and the AlphaFold DB currently contains over 200 million predicted protein structures (as of May 1st, 2023), making PANDA-3D a useful tool that can accurately annotate the functions of a large number of proteins. PANDA-3D can be freely accessed as a web server from http://dna.cs.miami.edu/PANDA-3D/ and as a repository from https://github.com/zwang-bioinformatics/PANDA-3D/.

by Chenguang Zhao, Tong Liu, and and Zheng Wang*

View in journal's website Access Software

EGG: Accuracy Estimation of Individual Multimeric Protein Models Using Deep Energy-Based Models and Graph Neural Networks

Abstract

Reliable and accurate methods of estimating the accuracy of predicted protein models are vital to understanding their respective utility. Discerning how the quaternary structure conforms can significantly improve our collective understanding of cell biology, systems biology, disease formation, and disease treatment. Accurately determining the quality of multimeric protein models is still computationally challenging, as the space of possible conformations is significantly larger when proteins form in complex with one another. Here, we present EGG (energy and graph-based architectures) to assess the accuracy of predicted multimeric protein models. We implemented message-passing and transformer layers to infer the overall fold and interface accuracy scores of predicted multimeric protein models. When evaluated with CASP15 targets, our methods achieved promising results against single model predictors: fourth and third place for determining the highest-quality model when estimating overall fold accuracy and overall interface accuracy, respectively, and first place for determining the top three highest quality models when estimating both overall fold accuracy and overall interface accuracy.

by Andrew Jordan Siciliano, Chenguang Zhao, Tong Liu, and and Zheng Wang*

View in journal's website Access Software

Learning Micro-C from Hi-C with diffusion models

Abstract

In the last few years, Micro-C has shown itself as an improved alternative to Hi-C. It replaced the restriction enzymes in Hi-C assays with micrococcal nuclease (MNase), resulting in capturing nucleosome resolution chromatin interactions. The signal-to-noise improvement of Micro-C allows it to detect more chromatin loops than high-resolution Hi-C. However, compared with massive Hi-C datasets available in the literature, there are only a limited number of Micro-C datasets. To take full advantage of these Hi-C datasets, we present HiC2MicroC, a computational method learning and then predicting Micro-C from Hi-C based on the denoising diffusion probabilistic models (DDPM). We trained our DDPM and other regression models in human foreskin fibroblast (HFFc6) cell line and evaluated these methods in six different cell types at 5-kb and 1-kb resolution. Our evaluations demonstrate that both HiC2MicroC and regression methods can markedly improve Hi-C towards Micro-C, and our DDPM-based HiC2MicroC outperforms regression in various terms. First, HiC2MicroC successfully recovers most of the Micro-C loops even those not detected in Hi-C maps. Second, a majority of the HiC2MicroC-recovered loops anchor CTCF binding sites in a convergent orientation. Third, HiC2MicroC loops share genomic and epigenetic properties with Micro-C loops, including linking promoters and enhancers, and their anchors are enriched for structural proteins (CTCF and cohesin) and histone modifications. Lastly, we find our recovered loops are also consistent with the loops identified from promoter capture Micro-C (PCMicro-C) and Chromatin Interaction Analysis by Paired-End Tag Sequencing (ChIA-PET). Overall, HiC2MicroC is an effective tool for further studying Hi-C data with Micro-C as a template. HiC2MicroC is publicly available at https://github.com/zwang-bioinformatics/HiC2MicroC/.

by Tong Liu, Hao Zhu, and Zheng Wang*

View in journal's website Access Software

C2c: Predicting Micro-C from Hi-C

Abstract

Motivation: High-resolution Hi-C data, capable of detecting chromatin features below the level of Topologically Associating Domains (TADs), significantly enhance our understanding of gene regulation. Micro-C, a variant of Hi-C incorporating a micrococcal nuclease (MNase) digestion step to examine interactions between nucleosome pairs, has been developed to overcome the resolution limitations of Hi-C. However, Micro-C experiments pose greater technical challenges compared to Hi-C, owing to the need for precise MNase digestion control and higher-resolution sequencing. Therefore, developing computational methods to derive Micro-C data from existing Hi-C datasets could lead to better usage of a large amount of existing Hi-C data in the scientific community and cost savings. Results: We developed C2c (high or upper case C to micro or lower case c), a computational tool based on a residual neural network to learn the mapping between Hi-C and Micro-C contact matrices and then predict Micro-C contact matrices based on Hi-C contact matrices. Our evaluation results show that the predicted Micro-C contact matrices reveal more chromatin loops than the input Hi-C contact matrices, and more of the loops detected from predicted Micro-C match the promoter¨Cenhancer interactions. Furthermore, we found that the mutual loops from real and predicted Micro-C better match the ChIA-PET data compared to Hi-C and real Micro-C loops, and the predicted Micro-C leads to more TAD-boundaries detected compared to the Hi-C data. The website URL of C2c can be found in the Data Availability Statement.

by Hao Zhu, Tong Liu, and Zheng Wang*

View in journal's website

Methylation of histone H3 lysine 36 is a barrier for therapeutic interventions of head and neck squamous cell carcinoma

Abstract

Approximately 20% of head and neck squamous cell carcinomas (HNSCCs) exhibit reduced methylation on lysine 36 of histone H3 (H3K36me) due to mutations in histone methylase NSD1 or a lysine-to-methionine mutation in histone H3 (H3K36M). Whether such alterations of H3K36me can be exploited for therapeutic interventions is still unknown. Here, we show that HNSCC models expressing H3K36M can be divided into two groups: those that display aberrant accumulation of H3K27me3 and those that maintain steady levels of H3K27me3. The former group exhibits reduced proliferation, genome instability, and heightened sensitivity to genotoxic agents like PARP1/2 inhibitors. Conversely, H3K36M HNSCC models with constant H3K27me3 levels lack these characteristics unless H3K27me3 is elevated by DNA hypomethylating agents or inhibiting H3K27me3 demethylases KDM6A/B. Mechanistically, H3K36M reduces H3K36me by directly impeding the activities of the histone methyltransferase NSD3 and the histone demethylase LSD2. Notably, aberrant H3K27me3 levels induced by H3K36M expression are not a bona fide epigenetic mark because they require continuous expression of H3K36M to be inherited. Moreover, increased sensitivity to PARP1/2 inhibitors in H3K36M HNSCC models depends solely on elevated H3K27me3 levels and diminishing BRCA1- and FANCD2-dependent DNA repair. Finally, a PARP1/2 inhibitor alone reduces tumor burden in a H3K36M HNSCC xenograft model with elevated H3K27me3, whereas in a model with consistent H3K27me3, a combination of PARP1/2 inhibitors and agents that up-regulate H3K27me3 proves to be successful. These findings underscore the crucial balance between H3K36 and H3K27 methylation in maintaining genome instability, offering new therapeutic options for patients with H3K36me-deficient tumors.

by Lucas D. Caeiro, Yuichiro Nakata, Rodrigo L. Borges, Mengsheng Zha, Liliana Garcia-Martinez, Carolina P. Baños1, Stephanie Stransky, Tong Liu, Ho Lam Chan, John Brabson, Diana Domíuez, Yusheng Zhang, Peter W. Lewis, Salvador Aznar Benitah, Luisa Cimmino, Daniel Bilbao, Simone Sidoli, Zheng Wang, Ramiro E. Verdun* and Lluis Morey*

View in journal's website Access Software

HiC4D: forecasting spatiotemporal Hi-C data with residual ConvLSTM

Abstract

The Hi-C experiments have been extensively used for the studies of genomic structures. In the last few years, spatiotemporal Hi-C has largely contributed to the investigation of genome dynamic reorganization. However, computationally modeling and forecasting spatiotemporal Hi-C data still have not been seen in the literature. We present HiC4D for dealing with the problem of forecasting spatiotemporal Hi-C data. We designed and benchmarked a novel network and named it residual ConvLSTM (ResConvLSTM), which is a combination of residual network and convolutional long short-term memory (ConvLSTM). We evaluated our new ResConvLSTM networks and compared them with the other five methods, including a naï network (NaiveNet) that we designed as a baseline method and four outstanding video-prediction methods from the literature: ConvLSTM, spatiotemporal LSTM (ST-LSTM), self-attention LSTM (SA-LSTM) and simple video prediction (SimVP). We used eight different spatiotemporal Hi-C datasets for the blind test, including two from mouse embryogenesis, one from somatic cell nuclear transfer (SCNT) embryos, three embryogenesis datasets from different species and two non-embryogenesis datasets. Our evaluation results indicate that our ResConvLSTM networks almost always outperform the other methods on the eight blind-test datasets in terms of accurately predicting the Hi-C contact matrices at future time-steps. Our benchmarks also indicate that all of the methods that we benchmarked can successfully recover the boundaries of topologically associating domains called on the experimental Hi-C contact matrices. Taken together, our benchmarks suggest that HiC4D is an effective tool for predicting spatiotemporal Hi-C data. HiC4D is publicly available at both http://dna.cs.miami.edu/HiC4D/ and https://github.com/zwang-bioinformatics/HiC4D/.

by Tong Liu and Zheng Wang*

View in journal's website Access Software

DeepChIA-PET: Accurately predicting ChIA-PET from Hi-C and ChIP-seq with deep dilated networks

Abstract

Chromatin interaction analysis by paired-end tag sequencing (ChIA-PET) can capture genome-wide chromatin interactions mediated by a specific DNA-associated protein. The ChIA-PET experiments have been applied to explore the key roles of different protein factors in chromatin folding and transcription regulation. However, compared with widely available Hi-C and ChIP-seq data, there are not many ChIA-PET datasets available in the literature. A computational method for accurately predicting ChIA-PET interactions from Hi-C and ChIP-seq data is needed that can save the efforts of performing wet-lab experiments. Here we present DeepChIA-PET, a supervised deep learning approach that can accurately predict ChIA-PET interactions by learning the latent relationships between ChIA-PET and two widely used data types: Hi-C and ChIP-seq. We trained our deep models with CTCF-mediated ChIA-PET of GM12878 as ground truth, and the deep network contains 40 dilated residual convolutional blocks. We first showed that DeepChIA-PET with only Hi-C as input significantly outperforms Peakachu, another computational method for predicting ChIA-PET from Hi-C but using random forests. We next proved that adding ChIP-seq as one extra input does improve the classification performance of DeepChIA-PET, but Hi-C plays a more prominent role in DeepChIA-PET than ChIP-seq. Our evaluation results indicate that our learned models can accurately predict not only CTCF-mediated ChIA-ET in GM12878 and HeLa but also non-CTCF ChIA-PET interactions, including RNA polymerase II (RNAPII) ChIA-PET of GM12878, RAD21 ChIA-PET of GM12878, and RAD21 ChIA-PET of K562. In total, DeepChIA-PET is an accurate tool for predicting the ChIA-PET interactions mediated by various chromatin-associated proteins from different cell types.

by Tong Liu and Zheng Wang*

View in journal's website Access Software

scHiMe: predicting single-cell DNA methylation levels based on single-cell Hi-C data

Abstract

Recently a biochemistry experiment named methyl-3C was developed to simultaneously capture the chromosomal conformations and DNA methylation levels on individual single cells. However, the number of data sets generated from this experiment is still small in the scientific community compared with the greater amount of single-cell Hi-C data generated from separate single cells. Therefore, a computational tool to predict single-cell methylation levels based on single-cell Hi-C data on the same individual cells is needed. We developed a graph transformer named scHiMe to accurately predict the base-pair-specific (bp-specific) methylation levels based on both single-cell Hi-C data and DNA nucleotide sequences. We benchmarked scHiMe for predicting the bp-specific methylation levels on all of the promoters of the human genome, all of the promoter regions together with the corresponding first exon and intron regions, and random regions on the whole genome. Our evaluation showed a high consistency between the predicted and methyl-3C-detected methylation levels. Moreover, the predicted DNA methylation levels resulted in accurate classifications of cells into different cell types, which indicated that our algorithm successfully captured the cell-to-cell variability in the single-cell Hi-C data. scHiMe is freely available at http://dna.cs.miami.edu/scHiMe/.

by Hao Zhu, Tong Liu and Zheng Wang*

View in journal's website Access Software MASS2 Access Software LAW

Predicting residue-specific qualities of individual protein models using residual neural networks and graph neural networks

Abstract

The estimation of protein model accuracy (EMA) or model quality assessment (QA) is important for protein structure prediction. An accurate EMA algorithm can guide the refinement of models or pick the best model or best parts of models from a pool of predicted tertiary structures. We developed two novel methods: MASS2 and LAW, for predicting residue-specific or local qualities of individual models, which incorporate residual neural networks and graph neural networks, respectively. These two methods use similar features extracted from protein models but different architectures of neural networks to predict the local accuracies of single models. MASS2 and LAW participated in the QA category of CASP14, and according to our evaluations based on CASP14 official criteria, MASS2 and LAW are the best and second-best methods based on the Z-scores of ASE/100, AUC, and ULR-1.F1. We also evaluated MASS2, LAW, and the residue-specific predicted deviations (between model and native structure) generated by AlphaFold2 on CASP14 AlphaFold2 tertiary structure (TS) models. LAW achieved comparable or better performances compared to the predicted deviations generated by AlphaFold2 on AlphaFold2 TS models, even though LAW was not trained on any AlphaFold2 TS models. Specifically, LAW performed better on AUC and ULR scores, and AlphaFold2 performed better on ASE scores. This means that AlphaFold2 is better at predicting deviations, but LAW is better at classifying accurate and inaccurate residues and detecting unreliable local regions. MASS2 and LAW can be freely accessed from http://dna.cs.miami.edu/MASS2-CASP14/ and http://dna.cs.miami.edu/LAW-CASP14/, respectively.

by Chenguang Zhao, Tong Liu and Zheng Wang*

View in journal's website Access Software

scHiCEmbed: bin-specific embeddings of single-cell Hi-C data using graph auto-encoders

Abstract

Most publicly accessible single-cell Hi-C data are sparse and cannot reach a higher resolution. Therefore, learning latent representations (bin-specific embeddings) of sparse single-cell Hi-C matrices would provide us with a novel way of mining valuable information hidden in the limited number of single-cell Hi-C contacts. We present scHiCEmbed, an unsupervised computational method for learning bin-specific embeddings of single-cell Hi-C data, and the computational system is applied to the tasks of 3D structure reconstruction of whole genomes and detection of topologically associating domains (TAD). The only input of scHiCEmbed is a raw or scHiCluster-imputed single-cell Hi-C matrix. The main process of scHiCEmbed is to embed each node/bin in a higher dimensional space using graph auto-encoders. The learned n-by-3 bin-specific embedding/latent matrix is considered the final reconstructed 3D genome structure. For TAD detection, we use constrained hierarchical clustering on the latent matrix to classify bins: S_Dbw is used to determine the optimal number of clusters, and each cluster is considered as one potential TAD. Our reconstructed 3D structures for individual chromatins at different cell stages reveal the expanding process of chromatins during the cell cycle. We observe that the TADs called from single-cell Hi-C data are not shared across individual cells and that the TAD boundaries called from raw or imputed single-cell Hi-C are significantly different from those called from bulk Hi-C, confirming the cell-to-cell variability in terms of TAD definitions. The source code for scHiCEmbed is publicly available, and the URL can be found in the conclusion section.

by Tong Liu and Zheng Wang*

View in journal's website

Functional similarities of protein-coding genes in topologically associating domains and spatially-proximate genomic regions

Abstract

Topologically associating domains (TADs) are the structural and functional units of the genome. However, the functions of protein-coding genes existing in the same or different TADs have not been fully investigated. We compared the functional similarities of protein-coding genes existing in the same TAD and between different TADs, and also in the same gap region (the region between two consecutive TADs) and between different gap regions. We found that the protein-coding genes from the same TAD or gap region are more likely to share similar protein functions, and this trend is more obvious with TADs than the gap regions. We further created two types of gene–gene spatial interaction networks: the first type is based on Hi-C contacts, whereas the second type is based on both Hi-C contacts and the relationship of being in the same TAD. A graph auto-encoder was applied to learn the network topology, reconstruct the two types of networks, and predict the functions of the central genes/nodes based on the functions of the neighboring genes/nodes. It was found that better performance was achieved with the second type of network. Furthermore, we detected long-range spatially-interactive regions based on Hi-C contacts and calculated the functional similarities of the gene pairs from these regions.

by Chenguang Zhao, Tong Liu, and Zheng Wang*

View in journal's website Access Software

PANDA2: protein function prediction using graph neural networks

Abstract

High-throughput sequencing technologies have generated massive protein sequences, but the annotations of protein sequences highly rely on the low-throughput and expensive biological experiments. Therefore, accurate and fast computational alternatives are needed to infer functional knowledge from protein sequences. The gene ontology (GO) directed acyclic graph (DAG) contains the hierarchical relationships between GO terms but is hard to be integrated into machine learning algorithms for functional predictions. We developed a deep learning system named PANDA2 to predict protein functions, which used the cutting-edge graph neural network to model the topology of the GO DAG and integrated the features generated by transformer protein language models. Compared with the top 10 methods in CAFA3, PANDA2 ranked first in cellular component ontology (CCO), tied first in biological process ontology (BPO) but had a higher coverage rate, and second in molecular function ontology (MFO). Compared with other recently-developed cutting-edge predictors DeepGOPlus, GOLabeler, and DeepText2GO, and benchmarked on another independent dataset, PANDA2 ranked first in CCO, first in BPO, and second in MFO. PANDA2 can be freely accessed from http://dna.cs.miami.edu/PANDA2/.

by Chenguang Zhao, Tong Liu, and Zheng Wang*

View in journal's website

Vitamin D modulation of mitochondrial oxidative metabolism and mTOR enforces stress adaptations and anti-cancer responses

Abstract

The relationship between the active form of vitamin D3 (1,25-dihydroxyvitamin D, 1,25(OH)2D) and reactive oxygen species (ROS), two integral signaling molecules of the cell, is poorly understood. This is striking, given that both factors are involved in cancer cell regulation and metabolism. Mitochondria (mt) dysfunction is one of the main drivers of cancer, producing more mitochondria, higher cellular energy and ROS that can enhance oxidative stress and stress tolerance responses. To study the effects of 1,25(OH)2D on metabolic and mt dysfunction, we used the vitamin D receptor (VDR)-sensitive MG-63 osteosarcoma cell model. Using biochemical approaches, 1,25(OH)2D decreased mt ROS levels, membrane potential, biogenesis, and translation, while enforcing endoplasmic reticulum/mitohormetic stress adaptive responses. Using a mitochondria-focused transcriptomic approach, gene set enrichment and pathway analyses show that 1,25(OH)2D lowered mt fusion/fission and oxidative phosphorylation (OXPHOS). By contrast, mitophagy, ROS defense, and epigenetic gene regulation were enhanced after 1,25(OH)2D treatment, as well as key metabolic enzymes that regulate fluxes of substrates for cellular architecture and a shift toward non-oxidative energy metabolism. ATACseq revealed putative oxi-sensitive and tumor-suppressing transcription factors that may regulate important mt functional genes such as the mTORC1 inhibitor, DDIT4/REDD1. DDIT4/REDD1 was predominantly localized to the outer mt membrane in untreated MG-63 cells yet sequestered in the cytoplasm after 1,25(OH)2D and rotenone treatments, suggesting a level of control by membrane depolarization to facilitate its cytoplasmic mTORC1 inhibitory function. The results show that 1,25(OH)2D activates distinct adaptive metabolic responses involving mitochondria to regain redox balance and control the growth of osteosarcoma cells.

by Mikayla Quigley, Sandra Rieger, Enrico Capobianco, Zheng Wang, Hengguang Zhao, Martin Hewison, and Thomas S. Lisse*

View in journal's website

The Polycomb protein RING1B enables estrogen-mediated gene expression by promoting enhancer-promoter interaction and R-loop formation

Abstract

Polycomb complexes have traditionally been prescribed roles as transcriptional repressors, though increasing evidence demonstrate they can also activate gene expression. However, the mechanisms underlying positive gene regulation mediated by Polycomb proteins are poorly understood. Here, we show that RING1B, a core component of Polycomb Repressive Complex 1, regulates enhancer–promoter interaction of the bona fide estrogen-activated GREB1 gene. Systematic characterization of RNA:DNA hybrid formation (R-loops), nascent transcription and RNA Pol II activity upon estrogen administration revealed a key role of RING1B in gene activation by regulating R-loop formation and RNA Pol II elongation. We also found that the estrogen receptor alpha (ERa) and RNA are both necessary for full RING1B recruitment to estrogen-activated genes. Notably, RING1B recruitment was mostly unaffected upon RNA Pol II depletion. Our findings delineate the functional interplay between RING1B, RNA and ERa to safeguard chromatin architecture perturbations required for estrogen-mediated gene regulation and highlight the crosstalk between steroid hormones and Polycomb proteins to regulate oncogenic programs.

by Yusheng Zhang, Tong Liu, Fenghua Yuan, Liliana Garcia-Martinez, Kyutae D Lee, Stephanie Stransky, Simone Sidoli, Ramiro E Verdun, Yanbin Zhang, Zheng Wang, and Lluis Morey*

View in journal's website Access Software

Inferring single-Cell 3D chromosomal structures based on the Lennard-Jones potential

Abstract

Reconstructing three-dimensional (3D) chromosomal structures based on single-cell Hi-C data is a challenging scientific problem due to the extreme sparseness of the single-cell Hi-C data. In this research, we used the Lennard-Jones potential to reconstruct both 500 kb and high-resolution 50 kb chromosomal structures based on single-cell Hi-C data. A chromosome was represented by a string of 500 kb or 50 kb DNA beads and put into a 3D cubic lattice for simulations. A 2D Gaussian function was used to impute the sparse single-cell Hi-C contact matrices. We designed a novel loss function based on the Lennard-Jones potential, in which the e value, i.e., the well depth, was used to indicate how stable the binding of every pair of beads is. For the bead pairs that have single-cell Hi-C contacts and their neighboring bead pairs, the loss function assigns them stronger binding stability. The Metropolis–Hastings algorithm was used to try different locations for the DNA beads, and simulated annealing was used to optimize the loss function. We proved the correctness and validness of the reconstructed 3D structures by evaluating the models according to multiple criteria and comparing the models with 3D-FISH data.

by Mengsheng Zha, Nan Wang, Chaoyang Zhang, and Zheng Wang*

View in journal's website Access Software

MASS: predict the global qualities of individual protein models using random forests and novel statistical potentials

Background

Protein model quality assessment (QA) is an essential procedure in protein structure prediction. QA methods can predict the qualities of protein models and identify good models from decoys. Clustering-based methods need a certain number of models as input. However, if a pool of models are not available, methods that only need a single model as input are indispensable.

Results

We developed MASS, a QA method to predict the global qualities of individual protein models using random forests and various novel energy functions. We designed six novel energy functions or statistical potentials that can capture the structural characteristics of a protein model, which can also be used in other protein-related bioinformatics research. MASS potentials demonstrated higher importance than the energy functions of RWplus, GOAP, DFIRE and Rosetta when the scores they generated are used as machine learning features. MASS outperforms almost all of the four CASP11 top-performing single-model methods for global quality assessment in terms of all of the four evaluation criteria officially used by CASP, which measure the abilities to assign relative and absolute scores, identify the best model from decoys, and distinguish between good and bad models. MASS has also achieved comparable performances with the leading QA methods in CASP12 and CASP13.

Conclusion

MASS and the source code for all MASS potentials are publicly available at http://dna.cs.miami.edu/MASS/.

by Tong Liu and Zheng Wang*

View in journal's website Access Software

normGAM: an R package to remove systematic biases in genome architecture mapping data

Background

The genome architecture mapping (GAM) technique can capture genome-wide chromatin interactions. However, besides the known systematic biases in the raw GAM data, we have found a new type of systematic bias. It is necessary to develop and evaluate effective normalization methods to remove all systematic biases in the raw GAM data.

Results

We have detected a new type of systematic bias, the fragment length bias, in the genome architecture mapping (GAM) data, which is significantly different from the bias of window detection frequency previously mentioned in the paper introducing the GAM method but is similar to the bias of distances between restriction sites existing in raw Hi-C data. We have found that the normalization method (a normalized variant of the linkage disequilibrium) used in the GAM paper is not able to effectively eliminate the new fragment length bias at 1?Mb resolution (slightly better at 30?kb resolution). We have developed an R package named normGAM for eliminating the new fragment length bias together with the other three biases existing in raw GAM data, which are the biases related to window detection frequency, mappability, and GC content. Five normalization methods have been implemented and included in the R package including Knight-Ruiz 2-norm (KR2, newly designed by us), normalized linkage disequilibrium (NLD), vanilla coverage (VC), sequential component normalization (SCN), and iterative correction and eigenvector decomposition (ICE).

Conclusion

Based on our evaluations, the five normalization methods can eliminate the four biases existing in raw GAM data, with VC and KR2 performing better than the others. We have observed that the KR2-normalized GAM data have a higher correlation with the KR-normalized Hi-C data on the same cell samples indicating that the KR-related methods are better than the others for keeping the consistency between the GAM and Hi-C experiments. Compared with the raw GAM data, the normalized GAM data are more consistent with the normalized distances from the fluorescence in situ hybridization (FISH) experiments. The source code of normGAM can be freely downloaded from http://dna.cs.miami.edu/normGAM/.

by Tong Liu and Zheng Wang*

View in journal's website Access Software

Exploring the 2D and 3D structural properties of topologically associating domains

Background

Topologically associating domains (TADs) are genomic regions with varying lengths. The interactions within TADs are more frequent than those between different TADs. TADs or sub-TADs are considered the structural and functional units of the mammalian genomes. Although TADs are important for understanding how genomes function, we have limited knowledge about their 3D structural properties.

Results

In this study, we designed and benchmarked three metrics for capturing the three-dimensional and two-dimensional structural signatures of TADs, which can help better understand TADs's structural properties and the relationships between structural properties and genetic and epigenetic features. The first metric for capturing 3D structural properties is radius of gyration, which in this study is used to measure the spatial compactness of TADs. The mass value of each DNA bead in a 3D structure is novelly defined as one or more genetic or epigenetic feature(s). The second metric is folding degree. The last metric is exponent parameter, which is used to capture the 2D structural properties based on TADs's Hi-C contact matrices. In general, we observed significant correlations between the three metrics and the genetic and epigenetic features. We made the same observations when using H3K4me3, transcription start sites, and RNA polymerase II to represent the mass value in the modified radius-of-gyration metric. Moreover, we have found that the TADs in the clusters of depleted chromatin states apparently correspond to smaller exponent parameters and larger radius of gyrations. In addition, a new objective function of multidimensional scaling for modelling chromatin or TADs 3D structures was designed and benchmarked, which can handle the DNA bead-pairs with zero Hi-C contact values.

Conclusion

The web server for reconstructing chromatin 3D structures using multiple different objective functions and the related source code are publicly available at http://dna.cs.miami.edu/3DChrom/.

by Tong Liu and Zheng Wang*

View in journal's website

The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

Background

The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function.

Results

Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory.

Conclusion

We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.

by Naihui Zhou, Yuxiang Jiang, Timothy R. Bergquist, Alexandra J. Lee, Balint Z. Kacsoh, Alex W. Crocker, Kimberley A. Lewis, George Georghiou, Huy N. Nguyen, Md Nafiz Hamid, Larry Davis, Tunca Dogan, Volkan Atalay, Ahmet S. Rifaioglu, Alperen Dalkiran, Rengul Cetin Atalay, Chengxin Zhang, Rebecca L. Hurto, Peter L. Freddolino, Yang Zhang, Prajwal Bhat, Fran Supek, José. Fernáez, Branislava Gemovic, Vladimir R. Perovic, Radoslav S. Davidovic, Neven Sumonja, Nevena Veljkovic, Ehsaneddin Asgari, Mohammad R.K. Mofrad, Giuseppe Profiti, Castrense Savojardo, Pier Luigi Martelli, Rita Casadio, Florian Boecker, Heiko Schoof, Indika Kahanda, Natalie Thurlby, Alice C. McHardy, Alexandre Renaux, Rabie Saidi, Julian Gough, Alex A. Freitas, Magdalena Antczak, Fabio Fabris, Mark N. Wass, Jie Hou, Jianlin Cheng, Zheng Wang, Alfonso E. Romero, Alberto Paccanaro, Haixuan Yang, Tatyana Goldberg, Chenguang Zhao, Liisa Holm, Petri Töen, Alan J. Medlar, Elaine Zosa, Itamar Borukhov, Ilya Novikov, Angela Wilkins, Olivier Lichtarge, Po-Han Chi, Wei-Cheng Tseng, Michal Linial, Peter W. Rose, Christophe Dessimoz, Vedrana Vidulin, Saso Dzeroski, Ian Sillitoe, Sayoni Das, Jonathan Gill Lees, David T. Jones, Cen Wan, Domenico Cozzetto, Rui Fa, Mateo Torres, Alex Warwick Vesztrocy, Jose Manuel Rodriguez, Michael L. Tress, Marco Frasca, Marco Notaro, Giuliano Grossi, Alessandro Petrini, Matteo Re, Giorgio Valentini, Marco Mesiti, Daniel B. Roche, Jonas Reeb, David W. Ritchie, Sabeur Aridhi, Seyed Ziaeddin Alborzi, Marie-Dominique Devignes, Da Chen Emily Koo, Richard Bonneau, Vladimir Gligorijevic, Meet Barot, Hai Fang, Stefano Toppo, Enrico Lavezzo, Marco Falda, Michele Berselli, Silvio C.E. Tosatto, Marco Carraro, Damiano Piovesan, Hafeez Ur Rehman, Qizhong Mao, Shanshan Zhang, Slobodan Vucetic, Gage S. Black, Dane Jo, Erica Suh, Jonathan B. Dayton, Dallas J. Larsen, Ashton R. Omdahl, Liam J. McGuffin, Danielle A. Brackenridge, Patricia C. Babbitt, Jeffrey M. Yunes, Paolo Fontana, Feng Zhang, Shanfeng Zhu, Ronghui You, Zihan Zhang, Suyang Dai, Shuwei Yao, Weidong Tian, Renzhi Cao, Caleb Chandler, Miguel Amezola, Devon Johnson, Jia-Ming Chang, Wen-Hung Liao, Yi-Wei Liu, Stefano Pascarelli, Yotam Frank, Robert Hoehndorf, Maxat Kulmanov, Imane Boudellioua, Gianfranco Politano, Stefano Di Carlo, Alfredo Benso, Kai Hakala, Filip Ginter, Farrokh Mehryary, Suwisa Kaewphan, Jari Bjö, Hans Moen, Martti E.E. Tolvanen, Tapio Salakoski, Daisuke Kihara, Aashish Jain, Tomislav muc, Adrian Altenhoff, Asa Ben-Hur, Burkhard Rost, Steven E. Brenner, Christine A. Orengo, Constance J. Jeffery, Giovanni Bosco, Deborah A. Hogan, Maria J. Martin, Claire ODonovan, Sean D. Mooney, Casey S. Greene, Predrag Radivojac* and Iddo Friedberg*

View in journal's website Access Software

HiCNN2: Enhancing the resolution of Hi-C data using an ensemble of convolutional neural networks

Abstract

We present a deep-learning package named HiCNN2 to learn the mapping between low-resolution and high-resolution Hi-C (a technique for capturing genome-wide chromatin interactions) data, which can enhance the resolution of Hi-C interaction matrices. The HiCNN2 package includes three methods each with a different deep learning architecture: HiCNN2-1 is based on one single convolutional neural network (ConvNet); HiCNN2-2 consists of an ensemble of two different ConvNets; and HiCNN2-3 is an ensemble of three different ConvNets. Our evaluation results indicate that HiCNN2-enhanced high-resolution Hi-C data achieve smaller mean squared error and higher Pearson's correlation coefficients with experimental high-resolution Hi-C data compared with existing methods HiCPlus and HiCNN. Moreover, all of the three HiCNN2 methods can recover more significant interactions detected by Fit-Hi-C compared to HiCPlus and HiCNN. Based on our evaluation results, we would recommend using HiCNN2-1 and HiCNN2-3 if recovering more significant interactions from Hi-C data is of interest, and HiCNN2-2 and HiCNN if the goal is to achieve higher reproducibility scores between the enhanced Hi-C matrix and the real high-resolution Hi-C matrix.

by Tong Liu and Zheng Wang*

View in journal's website Read the news

Epigenomic signatures underpin the axonal regenerative ability of dorsal root ganglia sensory neurons

Abstract

Axonal injury results in regenerative success or failure, depending on whether the axon lies in the peripheral or the CNS, respectively. The present study addresses whether epigenetic signatures in dorsal root ganglia discriminate between regenerative and non-regenerative axonal injury. Chromatin immunoprecipitation for the histone 3 (H3) post-translational modifications H3K9ac, H3K27ac and H3K27me3; an assay for transposase-accessible chromatin; and RNA sequencing were performed in dorsal root ganglia after sciatic nerve or dorsal column axotomy. Distinct histone acetylation and chromatin accessibility signatures correlated with gene expression after peripheral, but not central, axonal injury. DNA-footprinting analyses revealed new transcriptional regulators associated with regenerative ability. Machine-learning algorithms inferred the direction of most of the gene expression changes. Neuronal conditional deletion of the chromatin remodeler CCCTC-binding factor impaired nerve regeneration, implicating chromatin organization in the regenerative competence. Altogether, the present study offers the first epigenomic map providing insight into the transcriptional response to injury and the differential regenerative ability of sensory neurons.

by Ilaria Palmisano*, Matt C. Danzi, Thomas H. Hutson, Luming Zhou, Eilidh McLachlan, Elisabeth Serger, Kirill Shkura, Prashant K. Srivastava, Arnau Hervera, Nick O'Neill, Tong Liu, Hassen Dhrif, Zheng Wang, Miroslav Kubat, Stefan Wuchty, Matthias Merkenschlager, Liron Levi, Evan Elliott, John L. Bixby, Vance P. Lemmon and Simone Di Giovanni*

View in journal's website Access Software

Inferring the three-dimensional structures of the X-chromosome during X-inactivation

Abstract

The Hi-C experiment can capture the genome-wide spatial proximities of the DNA, based on which it is possible to computationally reconstruct the three-dimensional (3D) structures of chromosomes. The transcripts of the long non-coding RNA (lncRNA) Xist spread throughout the entire X-chromosome and alter the 3D structure of the X-chromosome, which also inactivates one copy of the two X-chromosomes in a cell. The Hi-C experiments are expensive and time-consuming to conduct, but the Hi-C data of the active and inactive X-chromosomes are available. However, the Hi-C data of the X-chromosome during the process of X-chromosome inactivation (XCI) are not available. Therefore, the 3D structure of the X-chromosome during the process of X-chromosome inactivation (XCI) remains to be unknown. We have developed a new approach to reconstruct the 3D structure of the X-chromosome during XCI, in which the chain of DNA beads representing a chromosome is stored and simulated inside a 3D cubic lattice. A 2D Gaussian function is used to model the zero values in the 2D Hi-C contact matrices. By applying simulated annealing and Metropolis-Hastings simulations, we first generated the 3D structures of the X-chromosome before and after XCI. Then, we used Xist localization intensities on the X-chromosome (RAP data) to model the traveling speeds or acceleration between all bead pairs during the process of XCI. The 3D structures of the X-chromosome at 3 hours, 6 hours, and 24 hours after the start of the Xist expression, which initiates the XCI process, have been reconstructed. The source code and the reconstructed 3D structures of the X-chromosome can be downloaded from http://dna.cs.miami.edu/3D-XCI/.

by Hao Zhu, Nan Wang, Jonathan Z. Sun, Ras B. Pandey and Zheng Wang*

View in journal's website Access Software

HiCNN: A very deep convolutional neural network to better enhance the resolution of Hi-C data

Motivations

High-resolution Hi-C data are indispensable for the studies of three-dimensional (3D) genome organization at kilobase level. However, generating high-resolution Hi-C data (e.g., 5 kb) by con-ducting Hi-C experiments needs millions of mammalian cells, which may eventually generate billions of paired-end reads with a high sequencing cost. Therefore, it will be important and helpful if we can enhance the resolutions of Hi-C data by computational methods.

Results

We developed a new computational method named HiCNN that used a 54-layer very deep con-volutional neural network to enhance the resolutions of Hi-C data. The network contains both global and local residual learning with multiple speedup techniques included resulting in fast con-vergence. We used mean squared errors and Pearson's correlation coefficients between real high-resolution and computationally predicted high-resolution Hi-C data to evaluate the method. The evaluation results show that HiCNN consistently outperforms HiCPlus, the only existing tool in the literature, when training and testing data are extracted from the same cell type (i.e., GM12878) and from two different cell types in the same or different species (i.e., GM12878 as training with K562 as testing, and GM12878 as training with CH12-LX as testing). We further found that the HiCNN-enhanced high-resolution Hi-C data are more consistent with real experi-mental high-resolution Hi-C data than HiCPlus-enhanced data in terms of indicating statistically significant interactions. Moreover, HiCNN can efficiently enhance low-resolution Hi-C data, which eventually help recover two chromatin loops that were confirmed by 3D-FISH.

Availability

HiCNN is freely available at http://dna.cs.miami.edu/HiCNN/.

by Tong Liu and Zheng Wang*

View in journal's website Access database

TADKB: Family classification and a knowledge base of topologically associating domains

Background

Topologically associating domains (TADs) are considered the structural and functional units of the genome. However, there is a lack of an integrated resource for TADs in the literature where researchers can obtain family classifications and detailed information about TADs.

Results

We built an online knowledge base TADKB integrating knowledge for TADs in eleven cell types of human and mouse. For each TAD, TADKB provides the predicted three-dimensional (3D) structures of chromosomes and TADs, and detailed annotations about the protein-coding genes and long non-coding RNAs (lncRNAs) existent in each TAD. Besides the 3D chromosomal structures inferred by population Hi-C, the single-cell haplotype-resolved chromosomal 3D structures of 17 GM12878 cells are also integrated in TADKB. A user can submit query gene/lncRNA ID/sequence to search for the TAD(s) that contain(s) the query gene or lncRNA. We also classified TADs into families. To achieve that, we used the TM-scores between reconstructed 3D structures of TADs as structural similarities and the Pearson's correlation coefficients between the fold enrichment of chromatin states as functional similarities. All of the TADs in one cell type were clustered based on structural and functional similarities respectively using the spectral clustering algorithm with various predefined numbers of clusters. We have compared the overlapping TADs from structural and functional clusters and found that most of the TADs in the functional clusters with depleted chromatin states are clustered into one or two structural clusters. This novel finding indicates a connection between the 3D structures of TADs and their DNA functions in terms of chromatin states.

Conclusion

TADKB is available at http://dna.cs.miami.edu/TADKB/.

by Tong Liu, Jacob Porter, Chenguang Zhao, Hao Zhu, Nan Wang, Zheng Sun, Yin-Yuan Mo and Zheng Wang*

View in journal's website Access software

Predicting protein residue-residue contacts using random forests and deep networks

Background

The ability to predict which pairs of amino acid residues in a protein are in contact with each other offers many advantages for various areas of research that focus on proteins. For example, contact prediction can be used to reduce the computational complexity of predicting the structure of proteins and even to help identify functionally important regions of proteins. These predictions are becoming especially important given the relatively low number of experimentally determined protein structures compared to the amount of available protein sequence data.

Results

Here we have developed and benchmarked a set of machine learning methods for performing residue-residue contact prediction, including random forests, direct-coupling analysis, support vector machines, and deep networks (stacked denoising autoencoders). These methods are able to predict contacting residue pairs given only the amino acid sequence of a protein. According to our own evaluations performed at a resolution of +/- two residues, the predictors we trained with the random forest algorithm were our top performing methods with average top 10 prediction accuracy scores of 85.13% (short range), 74.49% (medium range), and 54.49% (long range). Our ensemble models (stacked denoising autoencoders combined with support vector machines) were our best performing deep network predictors and achieved top 10 prediction accuracy scores of 75.51% (short range), 60.26% (medium range), and 43.85% (long range) using the same evaluation. These tests were blindly performed on targets from the CASP11 dataset; and the results suggested that our models achieved comparable performance to contact predictors developed by groups that participated in CASP11.

Conclusions

Due to the challenging nature of contact prediction, it is beneficial to develop and benchmark a variety of different prediction methods. Our work has produced useful tools with a simple interface that can provide contact predictions to users without requiring a lengthy installation process. In addition to this, we have released our C++ implementation of the direct-coupling analysis method as a standalone software package. Both this tool and our RFcon web server are freely available to the public at http://dna.cs.miami.edu/RFcon.

by Joseph Luttrell IV, Tong Liu, Chaoyang Zhang and Zheng Wang*

View in journal's website Access software

SCL: a lattice-based approach to infer three-dimensional chromosome structures from single-cell Hi-C data

Motivation

In contrast to population-based Hi-C data, single-cell Hi-C data are zero-inflated and do not indicate the frequency of proximate DNA segments. There are a limited number of computational tools that can model the three-dimensional structures of chromosomes based on single-cell Hi-C data.

Results

We developed SCL (Single-Cell Lattice), a computational method to reconstruct three-dimensional (3D) structures of chromosomes based on single-cell Hi-C data. We designed a loss function and a 2D Gaussian function specifically for the characteristics of single-cell Hi-C data. A chromosome is represented as beads-on-a-string and stored in a 3D cubic lattice. Metropolis-Hastings simulation and simulated annealing are used to simulate the structure and minimize the loss function. We evaluated the SCL-inferred 3D structures (at both 500 kb and 50 kb resolutions) using multiple criteria and compared them with the ones generated by another modeling software program. The results indicate that the 3D structures generated by SCL closely fit single-cell Hi-C data. We also found similar patterns of trans-chromosomal contact beads, Lamin-B1 enriched topological domains, and H3K4me3 enriched domains by mapping data from previous studies onto the SCL-inferred 3D structures.

Availability

The C++ source code of SCL is freely available at http://dna.cs.miami.edu/SCL/.

by Hao Zhu and Zheng Wang*

View in journal's website Access software

Reconstructing high-resolution chromosome three-dimensional structures by Hi-C complex networks

Background

Hi-C data have been widely used to reconstruct chromosomal three-dimensional (3D) structures. One of the key limitations of Hi-C is the unclear relationship between spatial distance and the number of Hi-C contacts. Many methods used a fixed parameter when converting the number of Hi-C contacts to wish distances. However, a single parameter cannot properly explain the relationship between wish distances and genomic distances or the locations of topologically associating domains (TADs).

Results

We have addressed one of the key issues of using Hi-C data, that is, the unclear relationship between spatial distances and the number of Hi-C contacts, which is crucial to understand significant biological functions, such as the enhancer-promoter interactions. Specifically, we developed a new method to infer this converting parameter and pairwise Euclidean distances based on the topology of the Hi-C complex network (HiCNet). The inferred distances were modeled by clustering coefficient and multiple other types of constraints. We found that our inferred distances between bead-pairs within the same TAD were apparently smaller than those distances between bead-pairs from different TADs. Our inferred distances had a higher correlation with fluorescence in situ hybridization (FISH) data, fitted the localization patterns of Xist transcripts on DNA, and better matched 156 pairs of protein-enabled long-range chromatin interactions detected by ChIA-PET. Using the inferred distances and another round of optimization, we further reconstructed 40?kb high-resolution 3D chromosomal structures of mouse male ES cells. The high-resolution structures successfully illustrate TADs and DNA loops (peaks in Hi-C contact heatmaps) that usually indicate enhancer-promoter interactions.

Conclusions

We developed a novel method to infer the wish distances between DNA bead-pairs from Hi-C contacts. High-resolution 3D structures of chromosomes were built based on the newly-inferred wish distances. This whole process has been implemented as a tool named HiCNet, which is publicly available at http://dna.cs.miami.edu/HiCNet/.

by Tong Liu and Zheng Wang*

View in journal's website Access software

GOGO: An improved algorithm to measure the semantic similarity between gene ontology terms

Measuring the semantic similarity between Gene Ontology (GO) terms is an essential step in functional bioinformatics research. We implemented a software named GOGO for calculating the semantic similarity between GO terms. GOGO has the advantages of both information-content-based and hybrid methods, such as Resnik's and Wang's methods. Moreover, GOGO is relatively fast and does not need to calculate information content (IC) from a large gene annotation corpus but still has the advantage of using IC. This is achieved by considering the number of children nodes in the GO directed acyclic graphs when calculating the semantic contribution of an ancestor node giving to its descendent nodes. GOGO can calculate functional similarities between genes and then cluster genes based on their functional similarities. Evaluations performed on multiple pathways retrieved from the saccharomyces genome database (SGD) show that GOGO can accurately and robustly cluster genes based on functional similarities. We release GOGO as a web server and also as a stand-alone tool, which allows convenient execution of the tool for a small number of GO terms or integration of the tool into bioinformatics pipelines for large-scale calculations. GOGO can be freely accessed or downloaded from http://dna.cs.miami.edu/GOGO/.

by Chenguang Zhao and Zheng Wang*

View in journal's website Access software

SOV_refine: A further refined definition of segment overlap score and its significance for protein structure similarity

Background

The segment overlap score (SOV) has been used to evaluate the predicted protein secondary structures, a sequence composed of helix (H), strand (E), and coil (C), by comparing it with the native or reference secondary structures, another sequence of H, E, and C. SOV's advantage is that it can consider the size of continuous overlapping segments and assign extra allowance to longer continuous overlapping segments instead of only judging from the percentage of overlapping individual positions as Q3 score does. However, we have found a drawback from its previous definition, that is, it cannot ensure increasing allowance assignment when more residues in a segment are further predicted accurately.

Results

A new way of assigning allowance has been designed, which keeps all the advantages of the previous SOV score definitions and ensures that the amount of allowance assigned is incremental when more elements in a segment are predicted accurately. Furthermore, our improved SOV has achieved a higher correlation with the quality of protein models measured by GDT-TS score and TM-score, indicating its better abilities to evaluate tertiary structure quality at the secondary structure level. We analyzed the statistical significance of SOV scores and found the threshold values for distinguishing two protein structures (SOV_refine > 0.19) and indicating whether two proteins are under the same CATH fold (SOV_refine > 0.94 and > 0.90 for three- and eight-state secondary structures respectively). We provided another two example applications, which are when used as a machine learning feature for protein model quality assessment and comparing different definitions of topologically associating domains. We proved that our newly defined SOV score resulted in better performance.

Conclusions

The SOV score can be widely used in bioinformatics research and other fields that need to compare two sequences of letters in which continuous segments have important meanings. We also generalized the previous SOV definitions so that it can work for sequences composed of more than three states (e.g., it can work for the eight-state definition of protein secondary structures). A standalone software package has been implemented in Perl with source code released. The software can be downloaded from http://dna.cs.miami.edu/SOV/.

by Tong Liu and Zheng Wang*

View in journal's website Access software

PANDA: Protein function prediction using domain architecture and affinity propagation

We developed PANDA (Propagation of Affinity and Domain Architecture) to predict protein functions in the format of Gene Ontology (GO) terms. PANDA at first executes profile-profile alignment algorithm to search against PfamA, KOG, COG, and SwissProt databases, and then launches PSI-BLAST against UniProt for homologue search. PANDA integrates a domain architecture inference algorithm based on the Bayesian statistics that calculates the probability of having a GO term. All the candidate GO terms are pooled and filtered based on Z-score. After that, the remaining GO terms are clustered using an affinity propagation algorithm based on the GO directed acyclic graph, followed by a second round of filtering on the clusters of GO terms. We benchmarked the performance of all the baseline predictors PANDA integrates and also for every pooling and filtering step of PANDA. It can be found that PANDA achieves better performances in terms of area under the curve for precision and recall compared to the baseline predictors. PANDA can be accessed from http://dna.cs.miami.edu/PANDA/.

by Zheng Wang*, Chenguang Zhao, Yiheng Wang, Zheng Sun and Nan Wang

View in journal's website Access software

scHiCNorm: a software package to eliminate systematic biases in single-cell Hi-C data

Summary

We build a software package scHiCNorm that uses zero-inflated and hurdle models to remove biases from single-cell Hi-C data. Our evaluations prove that our models can effectively eliminate systematic biases for single-cell Hi-C data, which better reveal cell-to-cell variances in terms of chromosomal structures.

Availability and implementation

scHiCNorm is available at http://dna.cs.miami.edu/scHiCNorm/. Perl scripts are provided that can generate bias features. Pre-built bias features for human (hg19 and hg38) and mouse (mm9 and mm10) are available to download. R scripts can be downloaded to remove biases.

by Tong Liu and Zheng Wang

View in journal's website

An expanded evaluation of protein function prediction methods shows an improvement in accuracy

Background

A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging.

Results

We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2.

Conclusions

The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent.

by Yuxiang Jiang and et al.

View in journal's website

Long non-coding RNAs as prognostic markers in human breast cancer

Long non-coding RNAs (lncRNAs) have been recently shown to play an important role in gene regulation and normal cellular functions, and disease processes. However, despite the overwhelming number of lncRNAs identified to date, little is known about their role in cancer for vast majority of them. The present study aims to determine whether lncRNAs can serve as prognostic markers in human breast cancer. We interrogated the breast invasive carcinoma dataset of the Cancer Genome Atlas (TCGA) at the cBioPortal consisting of ~ 1,000 cases. Among 2,730 lncRNAs analyzed, 577 lncRNAs had alterations ranging from 1% to 32% frequency, which include mutations, alterations of copy number and RNA expression. We found that deregulation of 11 lncRNAs, primarily due to copy number alteration, is associated with poor overall survival. At RNA expression level, upregulation of 4 lncRNAs (LINC00657, LINC00346, LINC00654 and HCG11) was associated with poor overall survival. A third signature consists of 9 lncRNAs (LINC00705, LINC00310, LINC00704, LINC00574, FAM74A3, UMODL1-AS1, ARRDC1-AS1, HAR1A, and LINC00323) and their upregulation can predict recurrence. Finally, we selected LINC00657 to determine their role in breast cancer, and found that LINC00657 knockout significantly suppresses tumor cell growth and proliferation, suggesting that it plays an oncogenic role. Together, these results highlight the clinical significance of lncRNAs, and thus, these lncRNAs may serve as prognostic markers for breast cancer.

by Hairong Liu and et al.

View in journal's website

Benchmarking deep networks for predicting residue-specific quality of individual protein models in CASP11

Quality assessment of a protein model is to predict the absolute or relative quality of a protein model using computational methods before the native structure is available. Single-model methods only need one model as input and can predict the absolute residue-specific quality of an individual model. Here, we have developed four novel single-model methods (Wang_deep_1, Wang_deep_2, Wang_deep_3, and Wang_SVM) based on stacked denoising autoencoders (SdAs) and support vector machines (SVMs). We evaluated these four methods along with six other methods participating in CASP11 at the global and local levels using Pearson¡¯s correlation coefficients and ROC analysis. As for residue-specific quality assessment, our four methods achieved better performance than most of the six other CASP11 methods in distinguishing the reliably modeled residues from the unreliable measured by ROC analysis; and our SdA-based method Wang_deep_1 has achieved the highest accuracy, 0.77, compared to SVM-based methods and our ensemble of an SVM and SdAs. However, we found that Wang_deep_2 and Wang_deep_3, both based on an ensemble of multiple SdAs and an SVM, performed slightly better than Wang_deep_1 in terms of ROC analysis, indicating that integrating an SVM with deep networks works well in terms of certain measurements.

by Tong Liu and et al.

View in journal's website Access software

Predicting DNA Methylation State of CpG Dinucleotide Using Genome Topological Features and Deep Networks

The hypo- or hyper-methylation of the human genome is one of the epigenetic features of leukemia. However, experimental approaches have only determined the methylation state of a small portion of the human genome. We developed deep learning based (stacked denoising autoencoders, or SdAs) software named DeepMethyl to predict the methylation state of DNA CpG dinucleotides using features inferred from three-dimensional genome topology (based on Hi-C) and DNA sequence patterns. We used the experimental data from immortalised myelogenous leukemia (K562) and healthy lymphoblastoid (GM12878) cell lines to train the learning models and assess prediction performance. We have tested various SdA architectures with different configurations of hidden layer(s) and amount of pre-training data and compared the performance of deep networks relative to support vector machines (SVMs). Using the methylation states of sequentially neighboring regions as one of the learning features, an SdA achieved a blind test accuracy of 89.7% for GM12878 and 88.6% for K562. When the methylation states of sequentially neighboring regions are unknown, the accuracies are 84.82% for GM12878 and 72.01% for K562. We also analyzed the contribution of genome topological features inferred from Hi-C. DeepMethyl can be accessed at http://dna.cs.usm.edu/deepmethyl/.

by Yiheng Wang and et al.

View in journal's website Access software

PCP-ML: Protein characterization package for machine learning

Background

Machine Learning (ML) has a number of demonstrated applications in protein prediction tasks such as protein structure prediction. To speed further development of machine learning based tools and their release to the community, we have developed a package which characterizes several aspects of a protein commonly used for protein prediction tasks with machine learning.

Findings

A number of software libraries and modules exist for handling protein related data. The package we present in this work, PCP-ML, is unique in its small footprint and emphasis on machine learning. Its primary focus is on characterizing various aspects of a protein through sets of numerical data. The generated data can then be used with machine learning tools and/or techniques. PCP-ML is very flexible in how the generated data is formatted and as a result is compatible with a variety of existing machine learning packages. Given its small size, it can be directly packaged and distributed with community developed tools for protein prediction tasks.

Conclusions

Source code and example programs are available under a BSD license at http://mlid.cps.cmich.edu/eickh1jl/tools/PCPML/. The package is implemented in C++ and accessible as a Python module.

by Yiheng Wang and et al.

View in journal's website Access software

SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines

Background

It is important to predict the quality of a protein structural model before its native structure is known. The method that can predict the absolute local quality of individual residues in a single protein model is rare, yet particularly needed for using, ranking and refining protein models.

Results

We developed a machine learning tool (SMOQ) that can predict the distance deviation of each residue in a single protein model. SMOQ uses support vector machines (SVM) with protein sequence and structural features (i.e. basic feature set), including amino acid sequence, secondary structures, solvent accessibilities, and residue-residue contacts to make predictions. We also trained a SVM model with two new additional features (profiles and SOV scores) on 20 CASP8 targets and found that including them can only improve the performance when real deviations between native and model are higher than 5?. The SMOQ tool finally released uses the basic feature set trained on 85 CASP8 targets. Moreover, SMOQ implemented a way to convert predicted local quality scores into a global quality score. SMOQ was tested on the 84 CASP9 single-domain targets. The average difference between the residue-specific distance deviation predicted by our method and the actual distance deviation on the test data is 2.637?. The global quality prediction accuracy of the tool is comparable to other good tools on the same benchmark.

Conclusions

SMOQ is a useful tool for protein single model quality assessment. Its source code and executable are available at: http://sysbio.rnet.missouri.edu/multicom_toolbox/.

by Renzhi Cao and et al.

View in journal's website

Designing and evaluating the MULTICOM protein local and global model quality prediction methods in the CASP10 experiment

Background

Protein model quality assessment is an essential component of generating and using protein structural models. During the Tenth Critical Assessment of Techniques for Protein Structure Prediction (CASP10), we developed and tested four automated methods (MULTICOM-REFINE, MULTICOM-CLUSTER, MULTICOM-NOVEL, and MULTICOM-CONSTRUCT) that predicted both local and global quality of protein structural models.

Results

MULTICOM-REFINE was a clustering approach that used the average pairwise structural similarity between models to measure the global quality and the average Euclidean distance between a model and several top ranked models to measure the local quality. MULTICOM-CLUSTER and MULTICOM-NOVEL were two new support vector machine-based methods of predicting both the local and global quality of a single protein model. MULTICOM-CONSTRUCT was a new weighted pairwise model comparison (clustering) method that used the weighted average similarity between models in a pool to measure the global model quality. Our experiments showed that the pairwise model assessment methods worked better when a large portion of models in the pool were of good quality, whereas single-model quality assessment methods performed better on some hard targets when only a small portion of models in the pool were of reasonable quality.

Conclusions

Since digging out a few good models from a large pool of low-quality models is a major challenge in protein structure prediction, single model quality assessment methods appear to be poised to make important contributions to protein structure modeling. The other interesting finding was that single-model quality assessment scores could be used to weight the models by the consensus pairwise model comparison method to improve its accuracy.

by Renzhi Cao and et al.

View in journal's website

Aberrant Epigenetic Gene Regulation in Lymphoid Malignancies

In lymphoid malignancies, aberrant epigenetic mechanisms such as DNA methylation and histone modifications influence chromatin architecture and can result in altered gene expression. These alterations commonly affect genes that play important roles in the cell cycle, apoptosis, and DNA repair in non-Hodgkin lymphoma (NHL). The ability to identify epigenetic modifications to these important genes has increased exponentially due to advances in technology. As a result, there are well-defined, gene-specific epigenetic aberrations associated with NHL comprising follicular lymphoma (FL), mantle cell lymphoma (MCL), chronic lymphocytic leukemia (CLL), and diffuse large B-cell lymphoma (DLBCL). The identification of these genes is important because they may be used as biomarkers for prognosis, diagnosis and in developing improved treatment strategies. Also important, in the control of gene expression, is the packaging of DNA within the nucleus of a cell. This packaging can be distorted by epigenetic alterations and may alter the accessibility of certain regions of the genome in cancer cells. This review discusses the impact of known epigenetic aberration on the regulation of gene expression in NHL and provides insight into the spatial conformation of the genome (DNA packaging) in acute lymphoblastic leukemia.

by Kristen H. Taylor and et al.

View in journal's website

The Properties of Genome Conformation and Spatial Gene Interaction and Regulation Networks of Normal and Malignant Human Cell Types

The spatial conformation of a genome plays an important role in the long-range regulation of genome-wide gene expression and methylation, but has not been extensively studied due to lack of genome conformation data. The recently developed chromosome conformation capturing techniques such as the Hi-C method empowered by next generation sequencing can generate unbiased, large-scale, high-resolution chromosomal interaction (contact) data, providing an unprecedented opportunity to investigate the spatial structure of a genome and its applications in gene regulation, genomics, epigenetics, and cell biology. In this work, we conducted a comprehensive, large-scale computational analysis of this new stream of genome conformation data generated for three different human leukemia cells or cell lines by the Hi-C technique. We developed and applied a set of bioinformatics methods to reliably generate spatial chromosomal contacts from high-throughput sequencing data and to effectively use them to study the properties of the genome structures in one-dimension (1D) and two-dimension (2D). Our analysis demonstrates that Hi-C data can be effectively applied to study tissue-specific genome conformation, chromosome-chromosome interaction, chromosomal translocations, and spatial gene-gene interaction and regulation in a three-dimensional genome of primary tumor cells. Particularly, for the first time, we constructed genome-scale spatial gene-gene interaction network, transcription factor binding site (TFBS) – TFBS interaction network, and TFBS-gene interaction network from chromosomal contact information. Remarkably, all these networks possess the properties of scale-free modular networks.

by Zheng Wang and et al.

View in journal's website

A large-scale evaluation of computational protein function prediction

Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.

by Predrag Radivojac and et al.

View in journal's website

Three-Level Prediction of Protein Function by Combining Profile-Sequence Search, Profile-Profile Search, and Domain Co-Occurrence Networks

Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era. However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function. Here, we developed a method that integrated profile-sequence alignment, profile-profile alignment, and Domain Co-Occurrence Networks (DCN) to predict protein function at different levels of complexity, ranging from obvious homology, to remote homology, to no homology. We tested the method blindingly in the 2011 Critical Assessment of Function Annotation (CAFA). Our experiments demonstrated that our three-level prediction method effectively increased the recall of function prediction while maintaining a reasonable precision. Particularly, our method can predict function terms defined by the Gene Ontology more accurately than three standard baseline methods in most situations, handle multi-domain proteins naturally, and make ab initio function prediction when no homology exists. These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).

by Zheng Wang and et al.

View in journal's website

RECURSIVE PROTEIN MODELING: A DIVIDE AND CONQUER STRATEGY FOR PROTEIN STRUCTURE PREDICTION AND ITS CASE STUDY IN CASP9

After decades of research, protein structure prediction remains a very challenging problem. In order to address the different levels of complexity of structural modeling, two types of modeling techniques — template-based modeling and template-free modeling — have been developed. Template-based modeling can often generate a moderate- to high-resolution model when a similar, homologous template structure is found for a query protein but fails if no template or only incorrect templates are found. Template-free modeling, such as fragment-based assembly, may generate models of moderate resolution for small proteins of low topological complexity. Seldom have the two techniques been integrated together to improve protein modeling. Here we develop a recursive protein modeling approach to selectively and collaboratively apply template-based and template-free modeling methods to model template-covered (i.e. certain) and template-free (i.e. uncertain) regions of a protein. A preliminary implementation of the approach was tested on a number of hard modeling cases during the 9th Critical Assessment of Techniques for Protein Structure Prediction (CASP9) and successfully improved the quality of modeling in most of these cases. Recursive modeling can signicantly reduce the complexity of protein structure modeling and integrate template-based and template-free modeling to improve the quality and efficiency of protein structure prediction.

by Jianlin Cheng and et al.

View in journal's website Access software

The MULTICOM toolbox for protein structure prediction

Background

As genome sequencing is becoming routine in biomedical research, the total number of protein sequences is increasing exponentially, recently reaching over 108 million. However, only a tiny portion of these proteins (i.e. ~75,000 or less than 0.07%) have solved tertiary structures determined by experimental techniques. The gap between protein sequence and structure continues to enlarge rapidly as the throughput of genome sequencing techniques is much higher than that of protein structure determination techniques. Computational software tools for predicting protein structure and structural features from protein sequences are crucial to make use of this vast repository of protein resources.

Results

To meet the need, we have developed a comprehensive MULTICOM toolbox consisting of a set of protein structure and structural feature prediction tools. These tools include secondary structure prediction, solvent accessibility prediction, disorder region prediction, domain boundary prediction, contact map prediction, disulfide bond prediction, beta-sheet topology prediction, fold recognition, multiple template combination and alignment, template-based tertiary structure modeling, protein model quality assessment, and mutation stability prediction.

Conclusions

These tools have been rigorously tested by many users in the last several years and/or during the last three rounds of the Critical Assessment of Techniques for Protein Structure Prediction (CASP7-9) from 2006 to 2010, achieving state-of-the-art or near performance. In order to facilitate bioinformatics research and technological development in the field, we have made the MULTICOM toolbox freely available as web services and/or software packages for academic use and scientific research. It is available at http://sysbio.rnet.missouri.edu/multicom_toolbox/.

by Jianlin Cheng and et al.

View in journal's website

Evolutionary dynamics of protein domain architecture in plants

Background

Protein domains are the structural, functional and evolutionary units of the protein. Protein domain architectures are the linear arrangements of domain(s) in individual proteins. Although the evolutionary history of protein domain architecture has been extensively studied in microorganisms, the evolutionary dynamics of domain architecture in the plant kingdom remains largely undefined. To address this question, we analyzed the lineage-based protein domain architecture content in 14 completed green plant genomes.

Results

Our analyses show that all 14 plant genomes maintain similar distributions of species-specific, single-domain, and multi-domain architectures. Approximately 65% of plant domain architectures are universally present in all plant lineages, while the remaining architectures are lineage-specific. Clear examples are seen of both the loss and gain of specific protein architectures in higher plants. There has been a dynamic, lineage-wise expansion of domain architectures during plant evolution. The data suggest that this expansion can be largely explained by changes in nuclear ploidy resulting from rounds of whole genome duplications. Indeed, there has been a decrease in the number of unique domain architectures when the genomes were normalized into a presumed ancestral genome that has not undergone whole genome duplications.

Conclusions

Our data show the conservation of universal domain architectures in all available plant genomes, indicating the presence of an evolutionarily conserved, core set of protein components. However, the occurrence of lineage-specific domain architectures indicates that domain architecture diversity has been maintained beyond these core components in plant genomes. Although several features of genome-wide domain architecture content are conserved in plants, the data clearly demonstrate lineage-wise, progressive changes and expansions of individual protein domain architectures, reinforcing the notion that plant genomes have undergone dynamic evolution.

by Xuecheng Zhang and et al.

View in journal's website

An iterative self-refining and self-evaluating approach for protein model quality estimation

Evaluating or predicting the quality of protein models (i.e., predicted protein tertiary structures) without knowing their native structures is important for selecting and appropriately using protein models. We describe an iterative approach that improves the performances of protein Model Quality Assurance Programs (MQAPs). Given the initial quality scores of a list of models assigned by a MQAP, the method iteratively refines the scores until the ranking of the models does not change. We applied the method to the model quality assessment data generated by 30 MQAPs during the Eighth Critical Assessment of Techniques for Protein Structure Prediction. To various degrees, our method increased the average correlation between predicted and real quality scores of 25 out of 30 MQAPs and reduced the average loss (i.e., the difference between the top ranked model and the best model) for 28 MQAPs. Particularly, for MQAPs with low average correlations (less than 0.4), the correlation can be increased by several times. Similar experiments conducted on the CASP9 MQAPs also demonstrated the effectiveness of the method. Our method is a hybrid method that combines the original method of a MQAP and the pair-wise comparison clustering method. It can achieve a high accuracy similar to a full pair-wise clustering method, but with much less computation time when evaluating hundreds of models. Furthermore, without knowing native structures, the iterative refining method can evaluate the performance of a MQAP by analyzing its model quality predictions.

by Zheng Wang and Jianlin Cheng

View in journal's website

APOLLO: a quality assessment service for single and multiple protein models

Summary

We built a web server named APOLLO, which can evaluate the absolute global and local qualities of a single protein model using machine learning methods or the global and local qualities of a pool of models using a pair-wise comparison approach. Based on our evaluations on 107 CASP9 (Critical Assessment of Techniques for Protein Structure Prediction) targets, the predicted quality scores generated from our machine learning and pair-wise methods have an average per-target correlation of 0.671 and 0.917, respectively, with the true model quality scores. Based on our test on 92 CASP9 targets, our predicted absolute local qualities have an average difference of 2.60 Åwith the actual distances to native structure.

Availability

http://sysbio.rnet.missouri.edu/apollo/. Single and pair-wise global quality assessment software is also available at the site.

by Zheng Wang and et al.

View in journal's website

A protein domain co-occurrence network approach for predicting protein function and inferring species phylogeny

Protein Domain Co-occurrence Network (DCN) is a biological network that has not been fully-studied. We analyzed the properties of the DCNs of H. sapiens, S. cerevisiae, C. elegans, D. melanogaster, and 15 plant genomes. These DCNs have the hallmark features of scale-free networks. We investigated the possibility of using DCNs to predict protein and domain functions. Based on our experiment conducted on 66 randomly selected proteins, the best of top 3 predictions made by our DCN-based aggregated neighbor-counting method achieved a semantic similarity score of 0.81 to the actual Gene Ontology terms of the proteins. Moreover, the top 3 predictions using neighbor-counting, ?2, and a SVM-based method achieved an accuracy of 66%, 59%, and 61%, respectively, when used to predict specific Gene Ontology terms of human target domains. These predictions on average had a semantic similarity score of 0.82, 0.80, and 0.79 to the actual Gene Ontology terms, respectively. We also used DCNs to predict whether a domain is an enzyme domain, and our SVM-based and neighbor-inference method correctly classified 79% and 77% of the target domains, respectively. When using DCNs to classify a target domain into one of the six enzyme classes, we found that, as long as there is one EC number available in the neighboring domains, our SVM-based and neighboring-counting method correctly classified 92.4% and 91.9% of the target domains, respectively. Furthermore, we benchmarked the performance of using DCNs to infer species phylogenies on six different combinations of 398 single-chromosome prokaryotic genomes. The phylogenetic tree of 54 prokaryotic taxa generated by our DCNs-alignment-based method achieved a 93.45% similarity score compared to the Bergey's taxonomy. In summary, our studies show that genome-wide DCNs contain rich information that can be effectively used to decipher protein function and reveal the evolutionary relationship among species.

by Zheng Wang and et al.

View in journal's website Access software

Soybean Knowledge Base (SoyKB): a web resource for soybean translational genomics

Background

Soybean Knowledge Base (SoyKB) is a comprehensive all-inclusive web resource for soybean translational genomics. SoyKB is designed to handle the management and integration of soybean genomics, transcriptomics, proteomics and metabolomics data along with annotation of gene function and biological pathway. It contains information on four entities, namely genes, microRNAs, metabolites and single nucleotide polymorphisms (SNPs).

Methods

SoyKB has many useful tools such as Affymetrix probe ID search, gene family search, multiple gene/metabolite search supporting co-expression analysis, and protein 3D structure viewer as well as download and upload capacity for experimental data and annotations. It has four tiers of registration, which control different levels of access to public and private data. It allows users of certain levels to share their expertise by adding comments to the data. It has a user-friendly web interface together with genome browser and pathway viewer, which display data in an intuitive manner to the soybean researchers, producers and consumers.

Conclusions

SoyKB addresses the increasing need of the soybean research community to have a one-stop-shop functional and translational omics web resource for information retrieval and analysis in a user-friendly way. SoyKB can be publicly accessed at http://soykb.org/.

by Zheng Wang and et al.

View in journal's website

A conformation ensemble approach to protein residue-residue contact

Background

Protein residue-residue contact prediction is important for protein model generation and model evaluation. Here we develop a conformation ensemble approach to improve residue-residue contact prediction. We collect a number of structural models stemming from a variety of methods and implementations. The various models capture slightly different conformations and contain complementary information which can be pooled together to capture recurrent, and therefore more likely, residue-residue contacts.

Methods

We applied our conformation ensemble approach to free modeling targets from both CASP8 and CASP9. Given a diverse ensemble of models, the method is able to achieve accuracies of. 48 for the top L/5 medium range contacts and. 36 for the top L/5 long range contacts for CASP8 targets (L being the target domain length). When applied to targets from CASP9, the accuracies of the top L/5 medium and long range contact predictions were. 34 and. 30 respectively.

Conclusions

When operating on a moderately diverse ensemble of models, the conformation ensemble approach is an effective means to identify medium and long range residue-residue contacts. An immediate benefit of the method is that when tied with a scoring scheme, it can be used to successfully rank models.

by Jesse Eickholt and et al.

View in journal's website Access software

MULTICOM: a multi-level combination approach to protein structure prediction and its assessments in CASP8

Motivation

Protein structure prediction is one of the most important problems in structural bioinformatics. Here we describe MULTICOM, a multi-level combination approach to improve the various steps in protein structure prediction. In contrast to those methods which look for the best templates, alignments and models, our approach tries to combine complementary and alternative templates, alignments and models to achieve on average better accuracy.

Results

The multi-level combination approach was implemented via five automated protein structure prediction servers and one human predictor which participated in the eighth Critical Assessment of Techniques for Protein Structure Prediction (CASP8), 2008. The MULTICOM servers and human predictor were consistently ranked among the top predictors on the CASP8 benchmark. The methods can predict moderate- to high-resolution models for most template-based targets and low-resolution models for some template-free targets. The results show that the multi-level combination of complementary templates, alternative alignments and similar models aided by model quality assessment can systematically improve both template-based and template-free protein modeling.

Availability

The MULTICOM server is freely available at http://casp.rnet.missouri.edu/multicom_3d.html

by Zheng Wang and et al.

View in journal's website

SeqRate: sequence-based protein folding type classification and rates prediction

Background

Protein folding rate is an important property of a protein. Predicting protein folding rate is useful for understanding protein folding process and guiding protein design. Most previous methods of predicting protein folding rate require the tertiary structure of a protein as an input. And most methods do not distinguish the different kinetic nature (two-state folding or multi-state folding) of the proteins. Here we developed a method, SeqRate, to predict both protein folding kinetic type (two-state versus multi-state) and real-value folding rate using sequence length, amino acid composition, contact order, contact number, and secondary structure information predicted from only protein sequence with support vector machines.

Results

We systematically studied the contributions of individual features to folding rate prediction. On a standard benchmark dataset, the accuracy of folding kinetic type classification is 80%. The Pearson correlation coefficient and the mean absolute difference between predicted and experimental folding rates (sec-1) in the base-10 logarithmic scale are 0.81 and 0.79 for two-state protein folders, and 0.80 and 0.68 for three-state protein folders. SeqRate is the first sequence-based method for protein folding type classification and its accuracy of fold rate prediction is improved over previous sequence-based methods. Its performance can be further enhanced with additional information, such as structure-based geometric contacts, as inputs.

Conclusions

Both the web server and software of predicting folding rate are publicly available at http://casp.rnet.missouri.edu/fold_rate/index.html.

by Guan Ning Lin and et al.

View in journal's website

SoyDB: a knowledge database of soybean transcription factors

Background

Transcription factors play the crucial rule of regulating gene expression and influence almost all biological processes. Systematically identifying and annotating transcription factors can greatly aid further understanding their functions and mechanisms. In this article, we present SoyDB, a user friendly database containing comprehensive knowledge of soybean transcription factors.

Description

The soybean genome was recently sequenced by the Department of Energy-Joint Genome Institute (DOE-JGI) and is publicly available. Mining of this sequence identified 5,671 soybean genes as putative transcription factors. These genes were comprehensively annotated as an aid to the soybean research community. We developed SoyDB - a knowledge database for all the transcription factors in the soybean genome. The database contains protein sequences, predicted tertiary structures, putative DNA binding sites, domains, homologous templates in the Protein Data Bank (PDB), protein family classifications, multiple sequence alignments, consensus protein sequence motifs, web logo of each family, and web links to the soybean transcription factor database PlantTFDB, known EST sequences, and other general protein databases including Swiss-Prot, Gene Ontology, KEGG, EMBL, TAIR, InterPro, SMART, PROSITE, NCBI, and Pfam. The database can be accessed via an interactive and convenient web server, which supports full-text search, PSI-BLAST sequence search, database browsing by protein family, and automatic classification of a new protein sequence into one of 64 annotated transcription factor families by hidden Markov models.

Conclusions

A comprehensive soybean transcription factor database was constructed and made publicly accessible at http://casp.rnet.missouri.edu/soydb/.

by Zheng Wang and et al.

View in journal's website

Evaluating the absolute quality of a single protein model using structural features and support vector machines

Knowing the quality of a protein structure model is important for its appropriate usage. We developed a model evaluation method to assess the absolute quality of a single protein model using only structural features with support vector machine regression. The method assigns an absolute quantitative score (i.e. GDT-TS) to a model by comparing its secondary structure, relative solvent accessibility, contact map, and beta sheet structure with their counterparts predicted from its primary sequence. We trained and tested the method on the CASP6 dataset using cross-validation. The correlation between predicted and true scores is 0.82. On the independent CASP7 dataset, the correlation averaged over 95 protein targets is 0.76; the average correlation for template-based and ab initio targets is 0.82 and 0.50, respectively. Furthermore, the predicted absolute quality scores can be used to rank models effectively. The average difference (or loss) between the scores of the top-ranked models and the best models is 5.70 on the CASP7 targets. This method performs favorably when compared with the other methods used on the same dataset. Moreover, the predicted absolute quality scores are comparable across models for different proteins. These features make the method a valuable tool for model quality assurance and ranking.

by Zheng Wang and et al.

View in journal's website Access software

NNcon: improved protein contact map prediction using 2D-recursive neural networks

Protein contact map prediction is useful for protein folding rate prediction, model selection and 3D structure prediction. Here we describe NNcon, a fast and reliable contact map prediction server and software. NNcon was ranked among the most accurate residue contact predictors in the Eighth Critical Assessment of Techniques for Protein Structure Prediction (CASP8), 2008. Both NNcon server and software are available at http://casp.rnet.missouri.edu/nncon.html.

by Allison N. Tegge and et al.

View in journal's website

Prediction of global and local quality of CASP8 models by MULTICOM series

Evaluating the quality of protein structure models is important for selecting and using models. Here, we describe the MULTICOM series of model quality predictors which contains three predictors tested in the CASP8 experiments. We evaluated these predictors on 120 CASP8 targets. The average correlations between predicted and real GDT-TS scores of the two semi-clustering methods (MULTICOM and MULTICOM-CLUSTER) and the one single-model ab initio method (MULTICOM-CMFR) are 0.90, 0.89, and 0.74, respectively; and their average difference (or GDT-TS loss) between the global GDT-TS scores of the top-ranked models and the best models are 0.05, 0.06, and 0.07, respectively. The average correlation between predicted and real local quality scores of the semi-clustering methods is above 0.64. Our results show that the novel semi-clustering approach that compares a model with top ranked reference models can improve initial quality scores generated by the ab initio method and a simple meta approach.

by Jianlin Cheng and et al.

CONFERENCE PROCEEDINGS

View in conference's website Access Software ST-ChIP

ST-ChIP: Accurate prediction of spatiotemporal ChIP-seq data with recurrent neural networks

Abstract

Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a powerful method for locating protein-DNA binding sites. Spatiotemporal ChIP-seq data greatly contribute to the studies of dynamic biological processes as they contain information from both spatial and temporal dimensions. However, we can hardly find a computational method for forecasting spatiotemporal ChIP-seq data in the literature. Here we present ST-ChIP, a supervised method using Long Short-Term Memory (LSTM) for predicting coverage or peaks of spatiotemporal ChIP-seq data. We benchmarked three recurrent neural networks and found that two of them achieved higher predictive performances on recovering coverage or peaks of the forecasting time steps. Our results demonstrate that enhancer regions are enriched with our predicted H3K4me1 coverage, and promoter regions are enriched with our predicted H3K4me3 peaks, which match the findings from other studies. In total, ST-ChIP is an effective method for accurately predicting spatiotemporal ChIP-seq data. ST-ChIP is publicly available at http://dna.cs.miami.edu/ST-ChIP/.

by Tong Liu and Zheng Wang*

CITATIONS

SPONSORS

The following sponsors have provided funding to support our research: