Citation Link: https://doi.org/10.25819/ubsi/10852
On the Complementarity of Video and Inertial Data for Human Activity Recognition
Alternate Title
Zur Komplementarität von Video- und Inertialdaten für die Erkennung menschlicher Aktivitäten
Source Type
Doctoral Thesis
Author
Institute
Subjects
Human Activity Recognition
DDC
004 Informatik
Issue Date
2025
Abstract
With research in fields such as psychiatry having shown strong links between activities and behavior, there has been a growing interest in the development of automatic activity recognition systems using machine learning methods, also known as Human Activity Recognition (HAR). Within the last decade, Deep Learning methods have surpassed classical machine learning models in performance and have become the de facto standard learning-based approach for sensor-based HAR. While Deep Learning has largely automated feature extraction from inertial data, reducing the dependence on expert-crafted features, it has inadvertently introduced new challenges to the activity recognition community. This dissertation, structured in two parts, addresses two of these core challenges associated with applying deep learning to inertial-based HAR by leveraging concepts and methodologies from the domain of computer vision.
The first part of the dissertation focuses on the so-called labeling bottleneck, which denotes the considerable manual effort and cost associated with annotating data from wearable inertial sensors. This issue has significantly limited the scale and complexity of publicly available HAR benchmark datasets, thereby negatively affecting methodological progress. In an effort to decrease annotator workload, a weak-annotation pipeline is proposed that only requires labels for representative segments of a synchronously recorded video stream by leveraging the discriminative capabilities of vision foundation models. The second part examines the sliding window problem, referring to the temporal modeling limitations caused by HAR approaches relying on fixed-length window classification. Showcasing a reformulated view to inertial-based HAR, this dissertation introduces vision-based Temporal Action Localization (TAL) into the inertial domain. Benchmark experiments demonstrate that both existing TAL models from the video domain and a newly proposed TAL-inspired architecture for inertial data significantly outperform classical inertial HAR models. By leveraging inter-segment temporal context, both approaches also exhibit reduced sensitivity to hyperparameters selected during segmentation.
The demonstrated use cases show how recent advancements in video-based activity recognition can help overcome limitations inherent to inertial sensing. While each approach exhibits certain constraints, these works offer novel perspectives on long-standing issues and introduce methodologies that, if adopted, could inspire further research and innovation within the inertial HAR community.
The first part of the dissertation focuses on the so-called labeling bottleneck, which denotes the considerable manual effort and cost associated with annotating data from wearable inertial sensors. This issue has significantly limited the scale and complexity of publicly available HAR benchmark datasets, thereby negatively affecting methodological progress. In an effort to decrease annotator workload, a weak-annotation pipeline is proposed that only requires labels for representative segments of a synchronously recorded video stream by leveraging the discriminative capabilities of vision foundation models. The second part examines the sliding window problem, referring to the temporal modeling limitations caused by HAR approaches relying on fixed-length window classification. Showcasing a reformulated view to inertial-based HAR, this dissertation introduces vision-based Temporal Action Localization (TAL) into the inertial domain. Benchmark experiments demonstrate that both existing TAL models from the video domain and a newly proposed TAL-inspired architecture for inertial data significantly outperform classical inertial HAR models. By leveraging inter-segment temporal context, both approaches also exhibit reduced sensitivity to hyperparameters selected during segmentation.
The demonstrated use cases show how recent advancements in video-based activity recognition can help overcome limitations inherent to inertial sensing. While each approach exhibits certain constraints, these works offer novel perspectives on long-standing issues and introduce methodologies that, if adopted, could inspire further research and innovation within the inertial HAR community.
Description
Kumulative Dissertation
File(s)![Thumbnail Image]()
Loading...
Name
Dissertation_Bock_Marius.pdf
Size
10.36 MB
Format
Adobe PDF
Checksum
(MD5):2c15fa9d48555d1179ddbd3cfb8ab88b
Owning collection

