Parallel and Scalable Precise Clustering

Byma, Stuart; Dhasade, Akash; Altenhoff, Adrian; Dessimoz, Christophe; Larus, James R.

doi:10.1145/3410463.3414646

Byma, Stuart; Dhasade, Akash; Altenhoff, Adrian; Dessimoz, Christophe; Larus, James R.

2020

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Abstract

This paper describes a new technique for parallelizing protein clustering, an important bioinformatics computation for the analysis of protein sequences. Protein clustering identifies groups of proteins that are similar because they share long sequences of similar amino acids. Given a collection of protein sequences, clustering can significantly reduce the computational effort required to identify all similar sequences by avoiding many negative comparisons. The challenge, however, is to build a clustering that misses as few similar sequences (or elements, more generally) as possible.

In this paper, we introduce precise clustering, a property that requires each pair of similar elements to appear together in at least one cluster. We show that transitivity in the data can be leveraged to merge clusters while maintaining a precise clustering, providing a basis for independently forming clusters. This allows us reformulate clustering as a bottom-up merge of independent clusters in a new algorithm called ClusterMerge. ClusterMerge exposes parallelism, enabling fast and scalable implementations.

We apply ClusterMerge to find similar amino acid sequences in a collection of proteins. ClusterMerge identifies 99.8% of similar pairs found by a full O(n(2)) comparison, with only half as many comparisons. More importantly, ClusterMerge is highly amenable to parallel and distributed computation. Our implementation achieves a speedup of 604 times on 768 cores (1400 times faster than a comparable single-threaded clustering implementation), a strong scaling efficiency of 90%, and a weak scaling efficiency of nearly 100%.

Details

Title Parallel and Scalable Precise Clustering

Author(s) Byma, Stuart ; Dhasade, Akash ; Altenhoff, Adrian ; Dessimoz, Christophe ; Larus, James R.

Published in Pact '20: Proceedings Of The Acm International Conference On Parallel Architectures And Compilation Techniques

Series International Conference on Parallel Architectures and Compilation Techniques

Pages 217-228

Conference ACM International Conference on Parallel Architectures and Compilation Techniques (PACT), Oct 03-07, 2020, ELECTR NETWORK

Date 2020-01-01

Publisher New York, ASSOC COMPUTING MACHINERY

ISSN 1089-795X

ISBN 978-1-4503-8075-1

Keywords

bioinformatics; protein clustering; parallel algorithms; algorithm; protein; web

DOI https://doi.org/10.1145/3410463.3414646

Other identifier(s) View record in Web of Science

Laboratories UPLARUS

Record Appears in Scientific production and competences > I&C - School of Computer and Communication Sciences > IC Archives > UPLARUS - Prof. Larus Group
Peer-reviewed publications
Conference Papers
Work produced at EPFL
Published

Record creation date 2021-12-18