Résumé

There are increasing interests in document layout representation learning and understanding. Transformer, with its great power, has become the mainstream model architecture and achieved promising results in this area. As elements in a document layout consist of multi-modal and multi-dimensional features such as position, size, and its text content, prior works represent each element by summing all feature embeddings into one unified vector in the input layer, which is then fed into the self-attention for element-wise interaction. However, this simple summation would potentially raise mixed correlations among heterogeneous features and bring noise to the representation learning. In this paper, we propose a novel two-step disentangled attention mechanism to allow more flexible feature interactions in the self-attention. Furthermore, inspired by the principles of document design (e.g., contrast, proximity), we propose an unsupervised learning objective to constrain the layout representations. We verify our approach on two layout understanding tasks, namely element role labeling and image captioning. Experiment results show that our approach achieves state-of-the-art performances. Moreover, we conduct extensive studies and observe better interpretability using our approach.

Détails