Citation Link: https://doi.org/10.25819/ubsi/10875
Preprocessing for Data-Driven Modeling with Probability Density Estimation
Translated Title
Datenvorverarbeitung für datengetriebene Modellierung mittels Dichteschätzung
Source Type
Doctoral Thesis
Author
Issue Date
2026
Abstract
In engineering, the modeling of complex systems plays a central role. Increasing computing power and storage capacities as well as the trend towards deep neural networks are resulting in more and more data being stored. This dissertation addresses two main challenges arising from handling large amounts of data for data-driven modeling: Firstly, the choice of a subset that is representative of the dataset from which it is selected. Secondly, the handling of unbalanced datasets, i.e., datasets with regimes of higher and lower point density.
The first challenge is addressed by developing a novel subset selection algorithm based on kernel density estimation. The method ensures that the selected subset is representative of the original dataset or any desired arbitrary distribution. A sophisticated yet simple approach to evaluating the estimated density allows to save computing time. The second challenge is addressed by introducing a data weighting method that extends the standard loss function. The weights for the individual data points are adjusted in such a way that data points from regions of sparser point densities are weighted higher and data points from regions of higher point densities are weighted lower in order to ensure a more balanced model performance. This approach is independent of model architecture and suited for any training algorithm.
The effectiveness of the developed methods is demonstrated by using benchmark datasets and real-world application examples. Among others, the examples of thermal modeling of a permanent magnet synchronous motor and a cold forming process are used. The results show that the presented method for subset selection can effectively select representative datasets and is on par with state-of-the-art approaches to subset selection. Additionally, the presented method is able to select the subset to represent arbitrary desired pdfs which gives the user much freedom of design. The introduced method for data weighting typically results in significant performance improvements for dynamic models, especially for imbalanced training datasets. Overall, these contributions provide a valuable contribution to the further development of data-driven modeling methods and offer practicable solutions for real-world challenges.
The first challenge is addressed by developing a novel subset selection algorithm based on kernel density estimation. The method ensures that the selected subset is representative of the original dataset or any desired arbitrary distribution. A sophisticated yet simple approach to evaluating the estimated density allows to save computing time. The second challenge is addressed by introducing a data weighting method that extends the standard loss function. The weights for the individual data points are adjusted in such a way that data points from regions of sparser point densities are weighted higher and data points from regions of higher point densities are weighted lower in order to ensure a more balanced model performance. This approach is independent of model architecture and suited for any training algorithm.
The effectiveness of the developed methods is demonstrated by using benchmark datasets and real-world application examples. Among others, the examples of thermal modeling of a permanent magnet synchronous motor and a cold forming process are used. The results show that the presented method for subset selection can effectively select representative datasets and is on par with state-of-the-art approaches to subset selection. Additionally, the presented method is able to select the subset to represent arbitrary desired pdfs which gives the user much freedom of design. The introduced method for data weighting typically results in significant performance improvements for dynamic models, especially for imbalanced training datasets. Overall, these contributions provide a valuable contribution to the further development of data-driven modeling methods and offer practicable solutions for real-world challenges.
File(s)![Thumbnail Image]()
Loading...
Name
Dissertation_Peter_Timm.pdf
Size
10.2 MB
Format
Adobe PDF
Checksum
(MD5):c5e3f040353cc6bc17d3cf1666ea6e98
Owning collection

