Interactive-time Exploration, Querying, and Analysis of Large High-dimensional Datasets

Basil John, Sachin

doi:10.5075/epfl-thesis-9017

Basil John, Sachin

2023

Télécharger

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Résumé

In the current era of big data, aggregation queries on high-dimensional datasets are frequently utilized to uncover hidden patterns, trends, and correlations critical for effective business decision-making. Data cubes facilitate such queries by employing pre-computation, but traditional data cube techniques struggle when managing hundreds of dimensions due to exponential increases in storage and time requirements for the pre-computation. This thesis presents Sudokube, an innovative data cube system, designed to facilitate efficient querying on high-dimensional data. Sudokube introduces an approach that supports high-dimensional data cubes with interactive query speeds and moderate storage costs. It is based on judiciously partially materialized binary-domain data cubes, and quickly reconstructing missing cuboids using statistical or linear programming techniques. Detailing Sudokube's functionality, this thesis explores the processes of data loading, cuboid selection for materialization, and efficient storage formats for optimizing space and projection time. It investigates the solvers used to reconstruct non-materialized cuboids, offering an in-depth comparison concerning speed, accuracy, and resource requirements. It also elaborates on Sudokube's supported queries and aggregation functions, underpinned by extensive experiments on real-world and synthetic datasets to demonstrate Sudokube's capabilities. In conclusion, this thesis provides a comprehensive examination of Sudokube, positing it as an effective solution to the inherent complexities of high-dimensional data exploration. The research signifies a substantial advancement in the high-dimensional data domain, empowering users to undertake exploratory data analysis for feature engineering, eliminating the necessity for compromise while loading data into a data cube, and enhancing the performance of queries with hierarchical dimensions. The insights from this work underline Sudokube's potential to foster advancements in data science methodologies and to open up new avenues in the field of big data analysis.

Détails

Titre Interactive-time Exploration, Querying, and Analysis of Large High-dimensional Datasets

Auteur(s) Basil John, Sachin

Directeur(s)

Koch, Christoph

Pagination 170

Date 2023

Editeur Lausanne, EPFL

Mots-clés (libres)

data cubes; online aggregation; high-dimensional; query approximation; online analytical processing; moments; linear programming; iterative proportional fitting; data exploration

Langue Anglais

DOI https://doi.org/10.5075/epfl-thesis-9017

Laboratoires DATA

Le document apparaît dans Production scientifique et compétences > I&C - Faculté Informatique & Communications > IINFCOM > DATA - Laboratoire de théorie et applications d'analyse de données
Production scientifique et compétences > Thèses EPFL
Travail produit à l'EPFL
Publié
Thèses

Date de création de la notice 2023-08-24

Files

Résumé

Détails

PDF