Generalizing Bulk-Synchronous Parallel Processing for Data Science: From Data to Threads and Agent-Based Simulations

Tian, Zilu

doi:10.5075/epfl-thesis-8865

Tian, Zilu

2023

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

Agent-based simulations have been widely applied in many disciplines, by scientists and engineers alike. Scientists use agent-based simulations to tackle global problems, including alleviating poverty, reducing violence, and predicting the impact of pandemics. In industry, engineers use agent-based simulations to reduce cost and improve efficiency, by creating virtual worlds to model different scenarios and explore various designs with fast feedback at low cost. Agent-based simulations play an increasingly prominent role in modern society. Despite their significance, agent-based simulations have benefited little from the recent progress in computer science, especially on the fronts of parallel computing and data management. While there has been a growing need to simulate at an ever-increasing scale with finer details, developments on systems that support fast execution of large-scale simulations and efficient integration of simulations with existing data science pipeline operators are dragging behind. This creates new challenges and opportunities for computer scientists. In this work, we make the first foray into defining a clean semantics that serves as the foundation of agent-based simulations, an abstraction that facilitates users to integrate simulations into data science pipelines, a scalable system architecture with efficient optimizations, and a high-level user-friendly programming model. In particular, we generalize the bulk-synchronous parallel (BSP) processing model to make it better support agent-based simulations. Such simulations frequently exhibit hierarchical structure in their communication patterns which can be exploited to improve performance. We allow for the creation of temporary artificial network partitions during which agents synchronize only locally within their group in a way that does not compromise the correctness of a simulation. We also propose to encapsulate simulations via a $\syntax{Simulate}$ operator, which enables users to compose and nest simulations just like other data science pipeline operators. In addition, we have designed and developed an open-source distributed system for large-scale agent-based simulations, CloudCity, which implements our semantics to improve the locality of computation, communication, and synchronization in simulations. This system contains efficient optimizations to allow fast execution and efficient query of simulation results. To accommodate users from different backgrounds, we have also developed a user-friendly domain-specific language (DSL) embedded in the programming language Scala, which allows users to write parallel agent programs easily, even with little or no background in distributed computing. We experimentally evaluate the performance of our system on a benchmark suite of agent-based simulations and compare it against existing state-of-the-art BSP-like distributed systems, including Spark, GraphX, Giraph, and Flink Gelly, obtaining insights into the impact of various system design choices and optimization on simulation engine performance.

Details

Title Generalizing Bulk-Synchronous Parallel Processing for Data Science: From Data to Threads and Agent-Based Simulations

Author(s) Tian, Zilu

Advisor(s)

Koch, Christoph

Pagination 182

Date 2023

Publisher Lausanne, EPFL

Keywords

agent-based simulations; distributed systems; bulk-synchronous parallel processing; compilation; query languages

Language English

DOI https://doi.org/10.5075/epfl-thesis-8865

Laboratories DATA

Record Appears in Scientific production and competences > I&C - School of Computer and Communication Sciences > IINFCOM > DATA - Data Analysis Theory and Applications Laboratory
Scientific production and competences > EPFL Theses
Work produced at EPFL
Published
Theses

Record creation date 2023-08-24

Files

Abstract

Details

PDF