Mitigating Class Imbalance in Tabular Data through Neural Network-based Synthetic Data Generation: A Comprehensive Survey and Library

Overview of our approach to create new synthetic tabular datasets using neural network-based generation.

Abstract

Imbalanced datasets often bias downstream models towards favoring majority classes, posing a critical challenge in deep learning, where extensive data is pivotal for optimal performance. Traditional solutions, such as classical data augmentation, often struggle with nuanced data traits and lack adaptability. The emergence of deep learning techniques like Auto Encoders (AEs), Generative Adversarial Networks (GANs), Diffusion Models (DMs), and Large Language Models (LLMs) opens promising avenues for addressing class imbalance through synthetic data generation. This paper presents a comprehensive survey of generative AI techniques for mitigating class imbalance in tabular datasets. These methods have the potential to improve the performance and efficiency of data-driven models across multiple domains. We evaluate their effectiveness in applications like handball play classification, income level prediction, and used car evaluation. We not only assess their efficacy in these real-world applications but also introduce computational efficiency tests, an often-overlooked aspect in this field. In addition to the survey, we present ‘GenTab,’ a synthetic tabular data generation library to facilitate the implementation and evaluation of the discussed approaches.

Publication
TechRxiv
Omar A. Mures
Omar A. Mures
Instructor

My research interests include Deep Learning, Computer Vision and Computer Graphics.