Files

Résumé

In the current era of big data, aggregation queries on high-dimensional datasets are frequently utilized to uncover hidden patterns, trends, and correlations critical for effective business decision-making. Data cubes facilitate such queries by employing pre-computation, but traditional data cube techniques struggle when managing hundreds of dimensions due to exponential increases in storage and time requirements for the pre-computation. This thesis presents Sudokube, an innovative data cube system, designed to facilitate efficient querying on high-dimensional data. Sudokube introduces an approach that supports high-dimensional data cubes with interactive query speeds and moderate storage costs. It is based on judiciously partially materialized binary-domain data cubes, and quickly reconstructing missing cuboids using statistical or linear programming techniques. Detailing Sudokube's functionality, this thesis explores the processes of data loading, cuboid selection for materialization, and efficient storage formats for optimizing space and projection time. It investigates the solvers used to reconstruct non-materialized cuboids, offering an in-depth comparison concerning speed, accuracy, and resource requirements. It also elaborates on Sudokube's supported queries and aggregation functions, underpinned by extensive experiments on real-world and synthetic datasets to demonstrate Sudokube's capabilities. In conclusion, this thesis provides a comprehensive examination of Sudokube, positing it as an effective solution to the inherent complexities of high-dimensional data exploration. The research signifies a substantial advancement in the high-dimensional data domain, empowering users to undertake exploratory data analysis for feature engineering, eliminating the necessity for compromise while loading data into a data cube, and enhancing the performance of queries with hierarchical dimensions. The insights from this work underline Sudokube's potential to foster advancements in data science methodologies and to open up new avenues in the field of big data analysis.

Détails

PDF