Preprocessing for Data-Driven Modeling with Probability Density Estimation

Peter, Timm J.

doi:10.25819/ubsi/10875

Citation Link: https://doi.org/10.25819/ubsi/10875

Preprocessing for Data-Driven Modeling with Probability Density Estimation

Translated Title

Datenvorverarbeitung für datengetriebene Modellierung mittels Dichteschätzung

Source Type

Doctoral Thesis

Series

Schriftenreihe der Arbeitsgruppe Mess- und Regelungstechnik - Mechatronik, Department Maschinenbau

Author

Peter, Timm J.

Institute

Institut für Mechanik und Regelungstechnik - Mechatronik

Subjects

Machine learning

Dataset selection

Data-driven modeling

DDC

620 Ingenieurwissenschaften und zugeordnete Tätigkeiten

GHBS-Clases

WFN

WGA

Issue Date

2026

Abstract

In engineering, the modeling of complex systems plays a central role. Increasing computing power and storage capacities as well as the trend towards deep neural networks are resulting in more and more data being stored. This dissertation addresses two main challenges arising from handling large amounts of data for data-driven modeling: Firstly, the choice of a subset that is representative of the dataset from which it is selected. Secondly, the handling of unbalanced datasets, i.e., datasets with regimes of higher and lower point density.
The first challenge is addressed by developing a novel subset selection algorithm based on kernel density estimation. The method ensures that the selected subset is representative of the original dataset or any desired arbitrary distribution. A sophisticated yet simple approach to evaluating the estimated density allows to save computing time. The second challenge is addressed by introducing a data weighting method that extends the standard loss function. The weights for the individual data points are adjusted in such a way that data points from regions of sparser point densities are weighted higher and data points from regions of higher point densities are weighted lower in order to ensure a more balanced model performance. This approach is independent of model architecture and suited for any training algorithm.
The effectiveness of the developed methods is demonstrated by using benchmark datasets and real-world application examples. Among others, the examples of thermal modeling of a permanent magnet synchronous motor and a cold forming process are used. The results show that the presented method for subset selection can effectively select representative datasets and is on par with state-of-the-art approaches to subset selection. Additionally, the presented method is able to select the subset to represent arbitrary desired pdfs which gives the user much freedom of design. The introduced method for data weighting typically results in significant performance improvements for dynamic models, especially for imbalanced training datasets. Overall, these contributions provide a valuable contribution to the further development of data-driven modeling methods and offer practicable solutions for real-world challenges.

DOI

10.25819/ubsi/10875

URN

nbn:de:hbz:467-87776

URI

https://dspace.ub.uni-siegen.de/handle/ubsi/8777

File(s)