分享

Learning multi

 qzbre 2023-09-29 发布于江苏

1 Introduction

Multi-task learning (MTL) is an approach of exploiting and transferring the relevant information among tasks to assist individual tasks obtain better generalization. Over the last few years, it has been proved to be effective in multiple different machine learning fields, such as object detection [1], image segmentation [2], image classification [3], natural language processing [4], speech recognition [5], drug discovery [6].
Currently, most existing MTL methods usually assume that learning tasks have the same label sets and use the same model [7-9]. Because these label sets contain abundant common knowledge and are transfered to each task to improve the learning performance of the MTL model [10]. However, there are more general situations in the real world, with only a small number of training samples in each task, and when their label sets overlap partially or even unoverlap, there would be less shared information between tasks, so learning such tasks will be more challenging. To meet the challenges, in [11], authors use a modulation and gating network to automatically adjust the shared characteristics among different tasks for the recommendation system. In [12], authors learn various tasks by sharing similar convolutional kernels among multi-task networks. These methods aim to mine and use as much common knowledge hidden in the current tasks as possible, but for the above-mentioned general scenarios, these methods still leave an improved room in performance. To achieve the above improvement, we re-focus on the two major issues affecting MTL. Firstly, how to extract suitable knowledge from different tasks for current multiple tasks. Since there is abundant knowledge hidden among multi-tasks, while they are not directly usable. How to extract the available knowledge to improve the learning performance of MTL models and avoid the notorious negative transfer is a key element [13]. For this reason, a large number of methods have been successfully developed so far, which can be broadly classified into non-deep and deep MTL methods: 1) the non-deep MTL methods build on shallow models to learn the parameters involved, e.g., in [14], authors extract useful knowledge between tasks by regularizing a task-coupled kernel function (such as a support vector machine) for the user's prediction of product selection. In [15], authors obtain useful knowledge between tasks by learning the same covariance matrix to predict students' test scores. 2) The deep MTL methods learn a shared representation from the individual task networks to improve their performance. E.g., in [16], authors design a feature matching network (i.e., knowledge transfer) to capture shared features in different tasks. In [17], authors use a segmented attention head module to capture useful knowledge between tasks for depth estimation. In [18], authors use a two-level graph neural network to learn useful knowledge of different tasks to improve the performance of the MTL model. Secondly, how to design an effective MTL sharing mechanism. An effective sharing mechanism can increase the predictive performance of the MTL model by using useful knowledge between related tasks [19]. Inspired by this motivation, many classic MTs sharing mechanisms have been designed. According to whether the task’s feature/label spaces are consistent among tasks we can divide these mechanisms into two types: homogeneous multi-task sharing and heterogeneous multi-task sharing, as shown in Tab.1. Homogeneous multi-task sharing is learning a shared representation in the same feature and label spaces. According to different sharing formats, it can be divided into 1) hard sharing based: the implementation of this type of methods assumes that all tasks share knowledge in the same hidden space. For example, in [20], authors perform semantic segmentation and depth prediction of images by aggregating features in specific layers between tasks. 2) Soft sharing based: the implementation of this type of methods assumes that all task models and parameters are independent, and the distance between model parameters is regularized to obtain similar parameters for joint learning. E.g., in [21], authors use the attention mechanism to share parameters in specific layers between different tasks to identify symptoms of depression. 3) Mixed sharing based: the implementation of this type of methods uses a special task strategy to select the layer of the multi-task network model can perform shared learning. Typically, in [22], authors use a specific task strategy to mix these common features with the current tasks for image semantic and normal segmentation. Heterogeneous multi-task sharing is the learning of a shared representation in tasks with heterogeneous features or label spaces, which can also be divided into: 1) Sparse sharing based: the implementation of this type of methods is to form a sub-network appropriate for individual tasks from an over-parameterized base network, and to extract the common knowledge from the overlapped parts of the sub-networks through the sparse strategy. For example, in [23], authors extract shared parameters as common knowledge to learn individual tasks by a mask in the overlapping part of the sub-networks. 2) Gradient sharing based: the implementation of this type of methods uses some similarities to measure the gradient difference between tasks and calculate the nonnegative weights in these tasks, thereby constructing a shared gradient. For example, in [24], authors construct the gradient difference among individual tasks by cosine distance to predict hospital mortality. 3) Hierarchical sharing based: the implementation of this type of methods performs hierarchical sharing for different overlapping areas between multiple tasks. For example, in [25], authors learn common knowledge from different levels of multiple task networks for natural language processing. Unfortunately, most of the above works are designed for the scenario where the label sets are the same among tasks, rather than for the scenario where the label sets are partially overlapped or even non-overlapping.
Tab.1 Comparison of various MTL sharing mechanisms
SM1 Homo2 Hete3 Super4 Methods5
Hard × [20,25]
Soft × [26,27]
Mixed × [28,29]
Sparse [30,31]
Gradient × [24,32]
Hierarchical × [33,34]

1 Sharing mechanisms.2 Homogeneous tasks.3 Heterogeneous tasks.4 Supervised learning.5 Algorithms.

Fortunately, some current work has been developed to deal with the above problems, especially the latter. For example, in [35], authors propose to simultaneously improve online handwriting prediction and character classification by combining cross-entropy loss with distance and similarity loss. In [36], authors design a dynamic routing protocol pattern to implement slot filling and intent detection. In [37], authors propose to use restricted softmax instead of standard softmax for label distribution non-IID data. However, these methods still aim to mine hidden common knowledge in all prediction tasks. Since the label sets among these tasks only partially overlap, or even have no overlap, they have less knowledge available than traditional MTL problems. It is not enough to obtain more available common knowledge from these multi-tasks for improving model performance in MTL by designing various learning mechanisms. Interestingly, in a recent study on domain adaptation [38], the authors argue that the generalization and effectiveness of feature representations can be further improved by transferring sufficient information from multiple source domains to the target domain. Inspired by this, we propose a new auxiliary task-based deep MTL framework (ADMTL), which assists the above tasks to learn efficiently by leveraging big tasks with sufficient information. This auxiliary task is not only abundant enough with the class information, but also covers all learning tasks. Specifically, as shown in Fig.1, we directly introduce a well-trained auxiliary network with the identical structure as the MTL network. Next, we design a novel knowledge selection strategy for extracting the available information from the auxiliary task network to assist in each task learning. The key idea of this strategy is to use a set of soft mask matrices to adaptively prune the neurons in the hidden layers of the auxiliary network for extracting available knowledge, and to construct the corresponding specific network for each task. Finally, we learn the ADMTL network in an end-to-end manner. In summary, the contributions of this paper are as follows:
Fig.1 Our proposed ADMTL networks. The network consists of three identical independent task networks. The left side is the auxiliary task network, and on the right is the multi-task learning network. \textcolor[RGB]0,111,198and\textcolor[RGB]247,3,2 indicate the direction in which the knowledge of the auxiliary task is transferred to multiple tasks, and the long light blue box indicates shared knowledge

Full size|PPT slide

● A new framework is proposed to address MTL scenarios with partially overlapping and non-overlapping label sets. It assists in the efficient learning of such MTL scenarios by using the learned large auxiliary tasks with sufficiently abundant classes of information, and without adding any hyper-parameters.
● A novel knowledge selection strategy is designed for improving the generalization performance of each task. It adaptively prunes the hidden layers neurons in the auxiliary task network by introducing a set of soft mask matrices for extracting auxiliary knowledge, and constructs a corresponding specific network for each task.
● Extensive experiments on multiple datasets with different settings demonstrate the significant competitiveness of our model in comparison with the-state-of-the-art methods.
The rest of this paper is arranged as follows. In Section 2, we briefly review related work in multi-task learning. In Section 3, we introduce the architecture of ADMTL, give the definition and some related theoretical application analysis. In the experimental stage of Section 4, we present image classification results on benchmark data sets. Finally, we conclude in Section 5. The code is available at GitHub website.

2 Related work

MTL has good performance in many applications, especially in the field of computer vision, so it has attracted a lot of attention in recent years. In this section, we briefly review the related works of MTL based on shared task features and MTL based on shared model parameters.

2.1 MTL based on shared task features

The methods of this class usually assume that a common feature representation can be learned from individual tasks. According to the implementation manners, they likewise can roughly be divided into three sub-types:
1) Selective sharing of task features: for the tasks in the same subspace, they realize sharing by specifically regularizing the features among tasks. Typically, in [39], authors use the 2 norm to regularize the task weight matrices to extract shared features for the test score prediction of most school students. In [40], authors use the 1,2 norm to regularize the weight matrices to extract shared features between tasks for learning multi-tasks with different feature dimensions. In [41], authors use 2,1 norm to regularize the weight matrices of various modal tasks to jointly select common features for multi-modal classification of Alzheimer’s disease.
2) Priori knowledge sharing of tasks: for the tasks defined in the same subspace, they use the same prior knowledge among tasks to realize sharing. Typically, in [42], authors embed prior knowledge (i.e., pathological images with different magnification belong to the same subclass) into the feature extraction process among different tasks to verify the relationship between tasks and pathological image categories for fine-grained classification and pathophysiological image classification. In [43], authors use a kind of meta data (i.e., contextual attributes) as a priori knowledge to capture the relationship between different tasks for multiple tasks clustering. In [44], authors use the same subclass of the gland area as the prior information in the convolutional neural network to guide the network inference for pathological colon image analysis.
3) Transformation sharing of task features: for the tasks represented in the same subspace, they realize sharing by performing the nonlinear transformations of the original feature representation among tasks. Typically, in [45], authors use a set of non-linearly transformed feature sharing units for image semantic segmentation and normal estimation. In [46], authors use the feature adapter to learn the non-linear transformation of the task features to automatically evaluate the child's speech ability.

2.2 MTL based on shared model parameters

The methods of this class usually associate different tasks with their partial model parameters or weights to realize sharing. According to their learning manners used, they can roughly be divided into three sub-types:
1) Weighted sharing of weight matrices: for the tasks represented in the same subspace, they realize sharing by weightedly combining a set of weight matrices among tasks. Typically, in [47], authors weigh the weight matrices among tasks for boundary classification of keywords. In [48], authors partition the weight matrices among tasks into common and private parts, then weight the common part for multi-label classification. In [49], authors weigh the weight matrices at the same spatial position in the pictures and transfer them to each task for image depth estimation, segmentation, and surface normal prediction.
2) Common factor sharing via decomposing individual weight matrices: for the weight matrix of each task model, they decompose these matrices into private and common parts, where the common part is used for sharing. Typically, in [50], authors decompose the weight matrices of multiple task models into common and private parts, and further use the common part for visual target tracking. In [51], authors sparsely decomposes the parameter tensor of the prediction model into multiple parameter matrices, and linearly combines the corresponding parameter matrices into a set of base matrices for sharing. In [52], authors decompose a collective matrix of drug-disease correlations to share the correlation matrix between them for drug discovery.
3) Low-rank structure sharing of model weight matrices: for the tasks represented in the same subspace, they capture the low-rank structure of the weight matrix among tasks by specifically regularizing to realize sharing. Typically, in [53], authors use feature tensor flattening of different tasks (i.e., a convex combination of matrix trace norms) to capture its low-rank structure for multi-task learning. In [54], authors use a set of low-rank matrices to capture the potential relationships between multiple tasks for Parkinson’s disease diagnosis. In [55], authors use a set of low-rank matrices constrained by the nuclear norm for target detection in hyper-spectral images. Our work follows the first line of research in that it extracts available knowledge from a well-trained large task model to assist in improving the predictive performance of the MTL model. First, we directly introduce a trained auxiliary large-task network with the identical structure as the MTL network. Then, we use a set of soft mask matrices to automatically extract available knowledge from the well-trained auxiliary task network and build a specific network corresponding to each task. Finally, end-to-end cross-task learning is performed on multiple task network.

3 Our method

In this section, we propose to leverage big auxiliary tasks with rich labels and class information to solve the MTL problem when labels overlap or even do not overlap labels between tasks. We first introduce the problem setting of MTL. Then, we will describe our method in detail according to Fig.2. The entire ADMTL network is learned adaptively without adding any hyper-parameters.
Fig.2 Illustration of the ADMTL network. In the auxiliary network, different colored cubes denote the knowledge that is extracted in each convolutional layer. In the multi-task learning network, all the different colored filled circles denote neurons, while the dashed circles are the pruned neurons. denotes the Hadamard product

Full size|PPT slide

3.1 Problem formulation

Given a big auxiliary task Taux and a dataset Daux={xi,yi}i=1N containing N samples with xiRd and its associated label yi{1,,c}, where d and c are the numbers of dimensions and classes in the dataset Daux, respectively. Meanwhile we are given M individual tasks {Tj}j=1M, and corresponding training dataset Dj={xkj,ykj}k=1Nj with Nj samples, xkjRd and its associated label ykj{1,,cj}, where cj is the number of classes in the dataset Dj. We assume that there are L convolutional layers in the auxiliary network, where the feature maps in the lth layer are denoted as Fauxl={fauxT1l,fauxT2l,,fauxTMl}, and the feature maps and convolutional kernels of the corresponding lth layer in the multi-task network are Fmultil={fT1l,fT2l,,fTMl}, Wmultil= {wT1l,wT2l,,wTMl}, where fauxT1l,fT1lRW×H, and W, H denote the width and height of the feature maps, respectively. Furthermore, we again assume that the classset CTaux of the auxiliary task contains all the individual tasks classes, namely, CTaux=CT1CTiCTj,...,CTM, where CTi and CTj(ij) can be partially overlapped, or even non-overlapped. This makes ADMTL applicable under more general settings than most existing MTL methods.

3.2 Selective transfer of knowledge

The key to MTL is to improve the predictive performance of each task by using common knowledge among correlated tasks [56]. For the label sets in MTL with partial overlap or even no overlap, it means that less correlated knowledge can be shared among tasks, which hinders the effective learning of MTL models. To address these issues, we assisted these tasks learning by using its abundant class information in a learned big task. Specifically, on the one hand, the label sets only partially overlap in MTL, and we use them for better shared learning by selectively transferring more suitable knowledge from the big task. On the other hand, the label sets do not overlap at the MTL, and we also enhance the generalization of the feature representations for cross-task learning by selectively transferring more available knowledge from the big task. As shown in Fig.3, we design an auxiliary knowledge selection strategy that aims to extract auxiliary knowledge for each task and construct corresponding task-specific networks for them. To this end, we directly introduce a set of soft mask matrices Ml=[m11lm1WlmhwlmH1lmHWl] in the lth layer of the trained auxiliary network and initialize them randomly by [57]. Then, we normalize Ml to ensure its value is between 0–1, as follows:
Fig.3 Illustration of auxiliary knowledge extraction. The green square in the middle indicates the feature map in the auxiliary network in the lth convolutional layer. The squares in the upper left and right corners indicate the soft mask matrices corresponding to the specific task, while the squares in the lower left and right corners indicate the extracted auxiliary feature maps

Full size|PPT slide

m^hwl=mhwlmin(Ml)max(Ml)min(Ml),
where m^hwl denotes the value of the element in row h and column w of M^l after normalization.
The purpose of introducing large auxiliary tasks is to transfer the rich knowledge therein to the ADMTL network and to assist it in learning. However, due to the overwhelming amount of information in this task, directly and brutally transferring it to smaller individual tasks often brings unnecessary redundant information, which leads to performance degradation of the individual task networks. Thus, we adopt a selectable way to extract the auxiliary knowledge from the auxiliary network. Here, we perform probability sparse on M^l, which is as follows:
m~hwl={mhwl,rand<m^hwl,0,randm^hwl,
where rand denotes a randomly generated threshold value in the range of 0-1, and m~hwl denotes the element values of the h row and w column in the sparse M~l. Among all elements of M~l, we retain the elements when rand<m^hwl and place them at 0 for elements randm^hwl. In this way, we argue that the higher the element value in M~l, the more important the extracted knowledge is for the current task, its assistance is more effective. Conversely, they are not necessary.
Next, we use sparse M~Tjl to extract the auxiliary knowledge for task-specific learning in the lth convolutional layer of the auxiliary network, as follows:
F~Tjl=M~TjlfauxTjl,
where fauxTjl denotes the feature maps used for the auxiliary task Tj learning at layer l in the auxiliary network, F~Tjl denotes the auxiliary features of task Tj in the lth layer, and denotes the Hadamard product.
As in the aforementioned discussions, in order to solve the problem of less correlation knowledge caused by partially overlapping or non-overlapping label sets in MTL. We transfer the extracted knowledge to the task-specific network to help train it, as follows:
F^Tjl=F~TjlfTjl,
where fTjl denotes the feature maps of task Tj in layer l of the ADMTL network, and F^Tjl denotes the features of task Tj transferred in layer l of that network.

3.3 Learning across-task

In this section, we improve the generalization performance of ADMTL networks through the knowledge sharing. Inspired by [45], we adopt a similar way to the cross-stitch unit to achieve cross-task learning as follows:
[F~T1lF~T2lF~TMl]=[λ11λ1MλM1λMM][F^T1lF^T2lF^TMl],
where the left side of the equation denotes the shared features in the lth layer of the ADMTL network, while the middle of the equation denotes the shared parameters of layer l, which are updated by the back propagation algorithm of the ADMTL network.
Finally, we construct the corresponding task-specific network by multiplying activations F~Tjl and wTjl, as follows:
FTjl+1=σ(wTjlF~Tjl+bTjl),
where FTjl+1 denotes the input to layer l+1 of the task Tj network, wTjl is the convolution kernel of this layer, denotes element-wise multiplication, and σ is the activation function, and bTjl is the bias vector.

3.4 Objective function

In the ADMTL network, the objective function for task Tj can be formulated as:
LTj=k=1NjykTj(logy^kTj),
where y^kTj is the predicted output.
Finally, we define the total objective function of the whole network as:
Ltotal=j=1kLTj+M2,
where 2 is the 2-norm.
The whole process of the proposed method to solve ADMTL is summarized in Algorithm 1.

4 Experiments

In this section, we report the results on multiple datasets to validate the effectiveness of the proposed method.

4.1 Experimental settings

Datasets We conduct experiments on 5 datasets, including:
ImageNet

See image-net.org website.

dataset is currently the largest computer vision dataset. The dataset contains 14,197,122 images and 21,841 Synset indexes, which mainly include: amphibian, animal, application, bird, covering, device, fabric, fish, flower, food, fruit, fungus, furniture, geological formation, invertebrate, mammal, musical instrument, plant, reptile, sport, structure, tool, tree, utensil, vector, vehicle and person. Each synset provides an average of 1,000 images. Each concept image is quality-controlled and human-annotated.
Office-Caltech

See people.eecs.berkeley.edu website.

dataset, which contains the Office-Caltech 10 dataset and the Office-Caltech 31 dataset, each of which has a total of 2,533 samples and consists of image datasets from three different databases: Caltech, Amazon and Webcam, with the smallest being 200 × 150 and the largest being 900 × 557.
Office-Home

See hemanthdv.org website.

dataset consists of four subsets of image datasets from different domains: Art, Clipart, Product and Real-World. Each subset has 65 different categories and 15,500 images, the image sizes are 117 × 85 and 4,384 × 2,686, respectively.
Caltech-256

See vision.caltech.edu website.

dataset is a very challenging dataset, which comes from Google Image Dataset and manually filters out all images that do not meet this category. The dataset has 256 object categories and contains a total of 30,607 images. The minimum number of images for any category increased from 31 to 80; the dataset also avoids artifacts due to image rotation and introduces a new, larger class of clutter to test background suppression.
Tiny–ImageNet

See kaggle.com/c/tiny-imagenet website.

dataset is a balanced and regular image classification data set provided by Stanford University. There are 200 balanced categories in the data set, each with 500 training images, 50 verification images and 50 test images, and the size of each image is consistent. In addition, the dataset also provides label categories and bounding boxes. The information on the above data set is shown in Tab.2. Furthermore, we split 70% of the data into the training set and the remaining 30% into the test set.
Tab.2 Characteristics of the experimental datasets
Data set Train Test classes
ImageNet 9,800,000 4,200,000 1,000
Office-Caltech 22,752 372 31
Office-Home 15,700 845 65
Caltech-256 1,512 648 256
Tiny-ImageNet 100,0000 20,000 200
In order to verify the generalization and effectiveness of the proposed method, we conducted two sets of multi-task learning experiments with the following setup:
Auxiliary task construction Due to the rich characteristics of the ImageNe dataset, we select it as the dataset for the auxiliary task in our experiments and make it contain the category information in all learning tasks. In addition, we pre-trained it using the identical network structure as the MTL network. In the experiment, we extract the corresponding auxiliary knowledge for each task directly from this auxiliary network to assist the joint learning of multiple tasks with partially overlapping or non-overlapping label sets.
Exp1 We construct an MTL task with partially overlapping label sets. The experiment consists of three groups of MTL classification tasks, each with two tasks, including Office-Caltech and Caltech-256, Amazon and Webcam, Dlsr and Product. There are only 10 classes overlapping in the label sets between each group of tasks. Details of this experiment are shown in Tab.3.
Tab.3 Summarize statistics for datasets where part of the label sets overlap
Data set Features Overlapping classes
Office-Caltech 30,000501,300 10
Office-Home 9,94511,775,424 10
Amazon 15,596 10
Webcam 30,000 10
Dlsr 30,000 10
Product 9,945 10
Exp2 We also construct an MTL task with non-overlapping label sets. The experiment consists of three groups of MTL classification tasks, each with two tasks, including Art and Real World, Caltech-101 and Webcam, Amazon and Tiny-ImagNet. The label sets between each group of tasks do not overlap. Details of this experiment are shown in Tab.4.
Tab.4 Summarize the statistics of the data sets with no-overlapping label sets
Data set Features Overlapping classes
Art 9,945
Real World 9,945
Office-Caltech 30,000501,300
Webcam 30,000
Amazon 15,596
Tiny-ImageNet 784

4.2 Comparison methods

We use common single-task [58] and multi-task [59] network architectures to train each task separately/jointly, and its experimental results serve as our single-task and multi-task baseline. Meanwhile, we compare our proposed method with other MTL methods including Cross-Stitch [45], NDDR-CNN [60], MTAL [12], LSSA [23], MCN [61] and MLwSGSU [36].

4.3 Implementation

In the contrasted deep neural network methods, we adjust the hidden units, learning rate, and the number of training steps in each layer according to the parameter settings of the corresponding reference. In ADMTL, we adjust the hyper-parameters in the same way. Specifically, we use VGG16 as the base network and set the input to 112 × 112 × 3 and the batch size to 16. In our experiments, to better train the ADMTL network we chose Adam as the optimizer and the rectified linear unit (ReLU) function as the activation function, with an initial learning rate of 0.001 and a decay of 50% each 30 iterations. All the deep learning models are implemented by PyTorch.

4.4 Comparison results

As shown in Tab.5, we report the test accuracy of each comparison method on three sets of MTL datasets, where the label sets between tasks partially overlapped. From the results, it can be observed that: 1) The MTL method ranks better than the single-task learning method on average in different multi-task groups. This shows that using the relationship between tasks to capture useful information of their interaction can promote the effectiveness of multi-task joint learning. 2) Different MTL methods have large gaps in their testing accuracy due to the difference in correlation between tasks. E.g., Cross-Stich and NDDR-CNN have the lowest accuracy rankings on Office-Caltech dataset in Group 1, while MLwSGSU and NDDR-CNN test accuracy ranks lowest on Caltech-256 dataset. 3) Since there is only partial label sets overlap between tasks, non-overlapping content may lead to large differences in the test results of different methods. E.g., MTAL ranks best in test accuracy in Group 3’s Dlsr dataset, while it is lower in the Product dataset. LSSA has the second highest test accuracy on the Product dataset and the lowest on the DLSR dataset. 4) Our proposed method significantly outperforms other methods on all data sets and achieves the best average ranking. This result shows that extracting and transferring features from a large auxiliary task can improve the performance of the MTL model.
Tab.5 Testing accuracy of each comparing method on Exp 1, where the optimal performances are bolded. The ranking and average ranking are reported in the corresponding bracket and the last column, respectively
Methods Group 1 Group 2 Group 3 Avg Rank
Office-Caltech Caltech-256 Amazon Webcam Dlsr Product
Single-task 0.76±0.037(6) 0.45±0.037(9) 0.74±0.036(5) 0.63±0.038(9) 0.77±0.031(5) 0.63±0.035(9) 7.16
Multi-task 0.76±0.050(7) 0.53±0.055(5) 0.80±0.046(3) 0.69±0.049(3) 0.80±0.046(3) 0.68±0.049(5) 4.33
Cross-Stich 0.75±0.050(8) 0.53±0.059(6) 0.80±0.046(3) 0.67±0.054(8) 0.79±0.042(4) 0.65±0.056(8) 6.16
NDDR-CNN 0.75±0.054(9) 0.51±0.055(8) 0.79±0.042(4) 0.67±0.044(7) 0.75±0.049(6) 0.67±0.050(6) 6.66
MTAL 0.78±0.054(5) 0.53±0.051(4) 0.81±0.042_(2) 0.69±0.048_(2) 0.80±0.038_(2) 0.66±0.055(7) 3.66
LSSA 0.80±0.058_(2) 0.54±0.373(3) 0.74±0.180(6) 0.69±0.155(5) 0.61±0.428(9) 0.73±0.337_(2) 4.50
MCN 0.79±0.341(4) 0.59±0.277_(2) 0.73±0.450(8) 0.68±0.138(6) 0.63±0.265(7) 0.70±0.272(4) 5.16
MLwSGSU 0.79±0.216(3) 0.53±0.421(7) 0.74±0.285(7) 0.69±0.103(4) 0.63±0.346(8) 0.71±0.050(3) 5.33
ADMTL 0.91±0.019(1) 0.82±0.197(1) 0.90±0.022(1) 0.81±0.133(1) 0.88±0.346(1) 0.92±0.097(1) 1.00
Tab.6 reports the test performance of each comparison method on 2 MTL tasks, where the label sets of different tasks do not overlap. From the results, it can be observed that: 1) In Group 1, Group 2 and Group 3 datasets, most of the MTL methods are due to single-task learning methods. 2) The test accuracy of the MTL methods varies widely between datasets in each group of MTL tasks. In particular, MLwSGSU in Group 2 ranks best on the T-ImagNet dataset, however, it ranks relatively low on the datasets Webcam and Amazon. This indicates that the presence of variability in the label sets between tasks has a greater impact on the performance of the model when they do not overlap at all. 3) Our proposed method achieves the best performance in most cases, with close performance rankings on each dataset, while achieving the best average rankings in both Group, Group 2 and Group 3 tasks. The experimental results again show that learning multiple tasks with non-overlapping label sets by introducing auxiliary tasks can effectively improve the performance of the MTL model.
Tab.6 Testing accuracy of each comparing method on Exp 2, where the optimal performances are bolded. The ranking and average ranking are reported in the corresponding bracket and the last column, respectively
Methods Group 1 Group 2 Group 3 Avg Rank
Art Real World Office-Caltech Webcam Tiny-ImagNet Amazon
Single-task 0.60±0.039(7) 0.51±0.043(7) 0.76±0.037(4) 0.63±0.038(6) 0.51±0.048(7) 0.77±0.031(6) 6.16
Multi-task 0.62±0.055(5) 0.53±0.054(4) 0.78±0.050(3) 0.68±0.050(3) 0.53±0.058(5) 0.78±0.048(5) 4.16
Cross-Stich 0.63±0.057(4) 0.51±0.056(8) 0.78±0.048_(2) 0.67±0.050(4) 0.52±0.062(6) 0.79±0.046(4) 4.66
NDDR-CNN 0.59±0.056(8) 0.52±0.056(5) 0.76±0.051(5) 0.66±0.050(5) 0.46±0.461(9) 0.79±0.044(3) 5.83
MTAL 0.61±0.056(6) 0.52±0.058(6) 0.73±0.046(6) 0.69±0.044_(2) 0.54±0.067(4) 0.80±0.042_(2) 4.33
LSSA 0.69±0.348(3) 0.57±0.712(3) 0.46±0.103(8) 0.50±0.836(8) 0.57±0.564(3) 0.68±0.125(8) 5.50
MCN 0.59±0.056(8) 0.52±0.056(5) 0.76±0.051(5) 0.66±0.050(5) 0.46±0.451(8) 0.79±0.044(3) 5.66
MLwSGSU 0.70±0.149_(2) 0.58±0.521_(2) 0.58±0.028(7) 0.63±0.493(7) 0.57±0.169_(2) 0.75±0.651(7) 4.50
ADMTL 0.88±0.026(1) 0.74±0.414(1) 0.93±0.043(1) 0.83±0.389(1) 0.91±0.029(1) 0.90±0.065(1) 1.00
Furthermore, combining Tab.5 and Tab.6, we can observe: 1) MLwSGSU achieves better average ranking in experiment Exp2, but lower in Exp1. Similarly, LSSA is ranked higher in Exp1 and lower in Exp2. This indicates that these 2 MTL methods are only applicable to one scenario. 2) Most of the test performances of our proposed method on different datasets are better than other methods, especially in Exp1 for the Office-Caltech, Webcam and Product datasets, and in Exp2 for the Art, Office-Caltech and Office-Caltech data are significantly competitive. 3) Our method has the best average ranking for all MTL tasks. Fig.4 shows a performance comparison of the mean and standard deviation of the classification accuracy of various methods in Exp1 and Exp2. We observe that the overall performance of the ADMTL method outperforms the other methods. The above experimental results are consistent with our theoretical analysis.
Fig.4 (a) and (b) show the performance comparison of the mean and mean squared error of various methods on Exp 1 and Exp 2, respectively

Full size|PPT slide

4.5 Time cost comparison

As shown in Fig.5, we observe that 1) the single-task method takes the shortest time in Exp1 and Exp2, but it does not use shared information and thus has the lowest accuracy. 2) The average time consumption of MTAL and LSSA is less than that of ADMTL among MTL learning methods, because these methods use structured pruning techniques. 3) The average time consumption and test accuracy of ADMTL are better than those of MCN and MLwSGSU, especially in Exp1. It is worth noting that the average test accuracy of ADMTL is significantly better than other methods. The above results demonstrate the good effectiveness of our proposed method.
Fig.5 Compare the time cost (in minutes) of MTL on Exp 1 (a) and Exp 2 (b)

Full size|PPT slide

4.6 Visualization of knowledge extraction

To further verify the effectiveness of knowledge extraction by mask, we visualize it in Exp 1 and Exp 2, respectively. From Fig.6 we can observe that 1) for similar tasks (all images above the red line), there are fewer masks in the object area of the sample but many outside it, especially in the feature maps in the 1st and 3rd squares above the red line. These indicate that the mask matrix extracts suitable knowledge and it can be better for shared learning. 2) Similarly, for different tasks (all images under the red line), in the feature map in the penultimate row we can clearly discover that there are significantly fewer masks with aircraft areas than without areas it. The above experimental results were successfully verified to be consistent with our hypothesis.
Fig.6 Visual illustration of knowledge extraction. The top four rows of the red line are visualized in Exp 1 by the mask extraction of knowledge, while the bottom four rows are visualized in Exp 2 by the mask extraction of knowledge. Where in each 4 × 3 square rows 1 and 2 are masked auxiliary tasks and row 3 are the learning tasks

Full size|PPT slide

4.7 Ablation study

In this section, to further verify the effectiveness of the ADMTL network, we conducted experimental comparisons by ordinary MTL (denoted by O-MTL), ADMTL framework-based MTL the feature alignment (denoted by F-MTL) and ADMTL in Exp1 and Exp2, respectively. Tab.7 and Tab.8 report the performance of these three methods in terms of test accuracy. From the tables, it can be observed that: 1) The performance of F-MTL is lower than ADMTL and even inferior to O-MTL. The reason for this result may be that the MTL model adds more negative information unfavorable to task-specific learning during the training process, especially in the presence of negative transfer. 2) The performance of ADMTL is significantly better than that of O-MTL and F-MTL, especially in Exp2 where the test accuracy on all tasks is optimal. These results convincingly validate that the use of feature alignment is necessary and effective for transferring knowledge for large auxiliary tasks in MTL networks.
Tab.7 Comparison of learning performance on Exp 1 among ordinary MTL (O-MTL), ADMTL framework-based MTL no feature alignment (F-MTL), and ADMTL. The optimal performances are bolded
Methods Group 1 Group 2 Group 3
Office-Caltech Caltech-256 Amazon Webcam Dlsr Product
O-MTL 0.76±0.050 0.53±0.05 0.80±0.04 0.69±0.04 0.80±0.04 0.68±0.04
F-MTL 0.80±0.050 0.54±0.05 0.80±0.04 0.70±0.04 0.84±0.04 0.74±0.05
ADMTL 0.91±0.01 0.82±0.19 0.90±0.22 0.81±0.02 0.88±0.34 0.92±0.09
Tab.8 Comparison of learning performance on Exp 2 among ordinary MTL (O-MTL), ADMTL framework-based MTL no feature alignment (F-MTL), and ADMTL. The optimal performances are bolded
Methods Group 1 Group 2 Group 3
Art Real World Office-Caltech Webcam Amazon Tiny-ImagNet
O-MTL 0.62±0.05 0.53±0.05 0.78±0.05 0.68±0.05 0.78±0.04 0.53±0.05
F-MTL 0.67±0.05 0.60±0.04 0.80±0.05 0.72±0.04 0.81±0.04 0.54±0.06
ADMTL 0.88±0.02 0.74±0.41 0.93±0.04 0.83±0.38 0.90±0.06 0.91±0.02
Tab.9 and Tab.10 report the comparison of the experimental results of the MTL methods using auxiliary tasks on EXP1, EXP2. Specifically, we compare the experimental results of two schemes for transferring auxiliary knowledge based on the ADMTL framework (denoted by T-ADMTL) and using ADMTL directly to transfer knowledge, respectively. From the above results, we can observe that; 1) in EXP1, T-ADMTL has slightly better test accuracy than ADMTL on datasets Office-Caltech, Caltech-256 and Amazon, while ADMTL outperforms T-ADMTL on datasets Webcam, Dlsr and Produc, especially on dataset Produc. This suggests that the use of auxiliary tasks in MTL learning can further improve its learning performance. 2) In EXP2, ADMTL performs slightly better than T-ADMTL on datasets Art, Real World and Office-Caltech, especially on dataset Art, while the difference between the test performance of T-ADMTL and ADMTL on datasets Webcam, Amazon and Tiny-ImagNet is small, especially on Webcam. This shows that transferring the selected auxiliary knowledge can also improve the generalization performance of MTL. In summary, there is no significant difference between the test accuracy of methods T-ADMTL and ADMTL, and both of them can improve the learning performance of MTL.
Tab.9 Comparison of learning performance on Exp 1 between ADMTL framework-based auxiliary knowledge transfer (T-ADMTL) and ADMTL. The optimal performances are bolded
Methods Group 1 Group 2 Group 3
Office-Caltech Caltech-256 Amazon Webcam Dlsr Product
T-ADMTL 0.93±0.25 0.82±0.05 0.91±0.06 0.81±0.30 0.87±0.09 0.88±0.04
ADMTL 0.91±0.01 0.82±0.19 0.90±0.02 0.81±0.02 0.88±0.34 0.92±0.09
Tab.10 Comparison of learning performance on Exp 2 between ADMTL framework-based auxiliary knowledge transfer (T-ADMTL) and ADMTL. The optimal performances are bolded
Methods Group 1 Group 2 Group 3
Art Real World Office-Caltech Webcam Amazon Tiny-ImagNet
T-ADMTL 0.85±0.16 0.74±0.30 0.93±0.30 0.83±0.19 0.93±0.02 0.91±0.20
ADMTL 0.88±0.02 0.74±0.30 0.93±0.03 0.83±0.38 0.91±0.02 0.90±0.06
However, by comparing the number of network parameters and FLOPs in Tab.11, we found that: 1) In the identical MTL network structure, the number of parameters of all convolutional layers of ADMTL is less than that of T-ADMTL. 2) ADMTL has a significantly lower number of floating points in each convolutional layer than T-ADMTL. As shown above, ADMTL is superior to T-ADMTL, and it has better effectiveness and practicality.
Tab.11 In the experiment, methods T-ADMTL and ADMTL are compared between the number of parameters and floating point calculations
Layers Params (103) FLOPs (106)
T-ADMTL ADMTL T-ADMTL ADMTL
conv_1 1.72 1.71 43.35 42.92
conv_2 73.72 72.06 462.42 451.99
conv3_1 294.91 288.88 462.42 452.97
conv3_2 589.82 575.83 924.84 902.24
conv4_1 1179.64 1145.46 462.42 448.48
conv4_2 2359.29 2299.50 924.84 900.03
conv5_1 2359.29 2285.66 231.21 224.67
conv5_2 2359.29 2303.18 231.21 225.36

4.8 Model convergence analysis

In this section, shown in Fig.7, the convergence of the ADMTL model on Exp 1 and Exp 2 is reported. On Exp 1, as shown in Fig. 7(a), the loss curves on datasets Caltech256 and Caltech-101 tend to converge after about 33 and 29 iterations, respectively; Figure 7(b) shows that the loss curves on the Amazon and Webcam datasets tend to converge after about 60 iterations; Figure 7(c) shows that the loss curves on the datasets Dlsr and Product tend to converge after about 58 iterations. On Exp 2, Figure 7(d) shows that the loss curves on datasets Art and Real World tend to converge after about 60 iterations; Figure 7(e) shows that the loss curves on datasets Office-Caltech and Webcam tend to converge after about 60 iterations; As shown in Fig.7 (f), the loss curves on the datasets Tiny-ImagNet and Amazon stabilize after about 30 and 24 iterations, respectively.
Fig.7 Illustration of the convergence of the model. Convergence curves of ADMTL on Exp 1 and Exp 2, respectively. (a) Loss curves on Caltech-256 and Office-Caltech; (b) loss curves on Amazon and Webcam; (c) loss curves on Dlsr and Product; (d) loss curves on Art and Real-world; (e) loss curves on Office-Caltech and Webcam; (f) loss curves on Amazon and Tiny-Imagnet

Full size|PPT slide

5 Conclusion

In this work, we provide a deep multi-task learning framework ADMTL, which is used to deal with multi-tasks with partial or unoverlapping label sets among tasks. Compared with the previous MTL method, ADMTL leverages big auxiliary tasks to jointly learn multiple tasks with partially overlapping or unoverlapping label sets. In addition, the auxiliary strategies in ADMTL can be flexibly embedded in other deep multi-task learning frameworks or transfer learning frameworks. In order to evaluate the performance of ADMTL, we conduct experiments on multiple public datasets and compared state-of-the-art MTL methods. Experimental results show that the ADMTL framework has significant advantages. In summary, our work can enrich MTL research to a certain extent from two aspects: 1) a novel adaptive MT learning mechanism is used to deal with multiple tasks when the label sets are partially overlapped or even unoverlapped. 2) A new knowledge extraction strategy that uses a set of soft masking matrices to adaptively prune the hidden neurons in the auxiliary task network to extract specific knowledge that assist the current task learn to form a corresponding network for each task. However, in this work, we haven’t solved the interpretable problems in MTL, and the following work will focus on such problems.

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多