Calender
07
February
Axel Berg thesis defense
Axel will defend his thesis "Machine Learning for Perception and Localization: Efficient and Invariant Methods”
Abstract:
This thesis covers a set of methods related to machine perception and localization, which are two important building blocks of artificial intelligence. In Paper I, we explore the concept of regression via classification (RvC), which is often used for perception tasks where the target variable is either ordinal or when the distance metric of the target space is not well suited as an objective function. However, it is not clear how the discretization of the target variable ought to be done. To this end, we introduce the concept of label diversity and propose a new loss function based on concepts from ensemble learning that can be used for both ordinal and continuous targets. Papers II and III deal with applying the concept of self-attention to different data domains. In Paper II we focus on point clouds, which are modeled as unordered sets in 3D space. Although applying self-attention to sets is straightforward, we find that this mechanism in itself is not enough to improve feature learning. Instead, we propose a hierarchical approach inspired by graph neural networks, where self-attention is applied to both patches of points, and to points within the patches. This results in improved predictive performance and reduced computational cost, while preserving invariance to permutations of points in the set. Auditory perception is an important part of machine intelligence and in Paper III we explore the use of self-attention for speech recognition. Using a simple Transformer architecture, we achieve stateof -the-art performance for speech classification. However, deploying speech recognition models in real world scenarios often involves making tradeoffs between predictive performance and computational costs. In Paper IV, we therefore explore floating point quantization of neural networks in the context of federated learning and propose a new method that allows training to be performed on low precision hardware. More specifically, we propose a method for quantization aware training and server-to-device communication in 8bit floating point. This allows for a significant reduction in the amount of data that needs to be communicated during the training process. Building upon the results in Paper III, we also show that our Transformer-based model can be quantized and trained in a realistic federated speech recognition setup and still achieve good performance. Papers V, VI and VII also deal with auditory perception, but from the localization point of view. This involves processing signals from microphone arrays and extracting spatial cues that enable the system to infer the location of the sound source. One such cue is the time difference of arrival (TDOA), which is estimated by correlating signals from different pairs of microphones. However, measuring TDOA in adverse acoustical conditions is difficult, which motivates the use of machine learning for this task. In Paper V, we propose a learningbased extension of a classical method for TDOA estimation that improves prediction accuracy, while simultaneously preserving some of the properties of the classical method. This is achieved by using a model architecture that is equivariant to time shifts together with an RvC training objective. TDOA estimates are often used as input to sound source localization (SSL) systems. In Paper VI, we extend the method from Paper V to predict TDOA’s from multiple overlapping sound sources and show that this is a good pretraining task for extracting correlation features to an SSL system, with improved localization performance compared to popular hand crafted input features. In Paper VII, we instead focus on a single sound source, but with variable number of microphones in the array. Most machine learning methods for SSL are trained using a specific microphone array setup and will not work if a microphone is turned off or moved to a different position. We solve this problem by modeling pairs of audio recordings and microphone coordinates as nodes in a multimodal graph. This enables the use of an attention-based auto-encoder model that infers the location of the sound source using both microphone coordinates, i.e. a set of points in 3D space, and audio features, while preserving invariance to permutations of microphones. Furthermore, we address variants of the problem where data is partially missing, such as signals from a microphone at an unknown location.
Thesis advisors:
Assoc. Prof. Magnus Oskarsson, Dr. Chuteng Zhou, Prof. Kalle Åström
Faculty opponent:
Assoc. Prof. Jesper Rindom Jensen
Aalborg University
Om händelsen
Tid:
2025-02-07 13:15
till
17:00
Plats
MH:H (Hörmander)
Kontakt
Magnus [dot] oskarsson [at] math [dot] lth [dot] se