Learning multi

qzbre 2023-09-29 发布于江苏

展开全文

1 Introduction

Multi-task learning (MTL) is an approach of exploiting and transferring the relevant information among tasks to assist individual tasks obtain better generalization. Over the last few years, it has been proved to be effective in multiple different machine learning fields, such as object detection [1], image segmentation [2], image classification [3], natural language processing [4], speech recognition [5], drug discovery [6].

Currently, most existing MTL methods usually assume that learning tasks have the same label sets and use the same model [7-9]. Because these label sets contain abundant common knowledge and are transfered to each task to improve the learning performance of the MTL model [10]. However, there are more general situations in the real world, with only a small number of training samples in each task, and when their label sets overlap partially or even unoverlap, there would be less shared information between tasks, so learning such tasks will be more challenging. To meet the challenges, in [11], authors use a modulation and gating network to automatically adjust the shared characteristics among different tasks for the recommendation system. In [12], authors learn various tasks by sharing similar convolutional kernels among multi-task networks. These methods aim to mine and use as much common knowledge hidden in the current tasks as possible, but for the above-mentioned general scenarios, these methods still leave an improved room in performance. To achieve the above improvement, we re-focus on the two major issues affecting MTL. Firstly, how to extract suitable knowledge from different tasks for current multiple tasks. Since there is abundant knowledge hidden among multi-tasks, while they are not directly usable. How to extract the available knowledge to improve the learning performance of MTL models and avoid the notorious negative transfer is a key element [13]. For this reason, a large number of methods have been successfully developed so far, which can be broadly classified into non-deep and deep MTL methods: 1) the non-deep MTL methods build on shallow models to learn the parameters involved, e.g., in [14], authors extract useful knowledge between tasks by regularizing a task-coupled kernel function (such as a support vector machine) for the user's prediction of product selection. In [15], authors obtain useful knowledge between tasks by learning the same covariance matrix to predict students' test scores. 2) The deep MTL methods learn a shared representation from the individual task networks to improve their performance. E.g., in [16], authors design a feature matching network (i.e., knowledge transfer) to capture shared features in different tasks. In [17], authors use a segmented attention head module to capture useful knowledge between tasks for depth estimation. In [18], authors use a two-level graph neural network to learn useful knowledge of different tasks to improve the performance of the MTL model. Secondly, how to design an effective MTL sharing mechanism. An effective sharing mechanism can increase the predictive performance of the MTL model by using useful knowledge between related tasks [19]. Inspired by this motivation, many classic MTs sharing mechanisms have been designed. According to whether the task’s feature/label spaces are consistent among tasks we can divide these mechanisms into two types: homogeneous multi-task sharing and heterogeneous multi-task sharing, as shown in Tab.1. Homogeneous multi-task sharing is learning a shared representation in the same feature and label spaces. According to different sharing formats, it can be divided into 1) hard sharing based: the implementation of this type of methods assumes that all tasks share knowledge in the same hidden space. For example, in [20], authors perform semantic segmentation and depth prediction of images by aggregating features in specific layers between tasks. 2) Soft sharing based: the implementation of this type of methods assumes that all task models and parameters are independent, and the distance between model parameters is regularized to obtain similar parameters for joint learning. E.g., in [21], authors use the attention mechanism to share parameters in specific layers between different tasks to identify symptoms of depression. 3) Mixed sharing based: the implementation of this type of methods uses a special task strategy to select the layer of the multi-task network model can perform shared learning. Typically, in [22], authors use a specific task strategy to mix these common features with the current tasks for image semantic and normal segmentation. Heterogeneous multi-task sharing is the learning of a shared representation in tasks with heterogeneous features or label spaces, which can also be divided into: 1) Sparse sharing based: the implementation of this type of methods is to form a sub-network appropriate for individual tasks from an over-parameterized base network, and to extract the common knowledge from the overlapped parts of the sub-networks through the sparse strategy. For example, in [23], authors extract shared parameters as common knowledge to learn individual tasks by a mask in the overlapping part of the sub-networks. 2) Gradient sharing based: the implementation of this type of methods uses some similarities to measure the gradient difference between tasks and calculate the nonnegative weights in these tasks, thereby constructing a shared gradient. For example, in [24], authors construct the gradient difference among individual tasks by cosine distance to predict hospital mortality. 3) Hierarchical sharing based: the implementation of this type of methods performs hierarchical sharing for different overlapping areas between multiple tasks. For example, in [25], authors learn common knowledge from different levels of multiple task networks for natural language processing. Unfortunately, most of the above works are designed for the scenario where the label sets are the same among tasks, rather than for the scenario where the label sets are partially overlapped or even non-overlapping.

Tab.1 Comparison of various MTL sharing mechanisms

SM¹	Homo²	Hete³	Super⁴	Methods⁵
Hard	$\sqrt$	$\times$	$\sqrt$	[20,25]
Soft	$\sqrt$	$\times$	$\sqrt$	[26,27]
Mixed	$\sqrt$	$\times$	$\sqrt$	[28,29]
Sparse	$\sqrt$	$\sqrt$	$\sqrt$	[30,31]
Gradient	$\times$	$\sqrt$	$\sqrt$	[24,32]
Hierarchical	$\times$	$\sqrt$	$\sqrt$	[33,34]

¹ Sharing mechanisms.² Homogeneous tasks.³ Heterogeneous tasks.⁴ Supervised learning.⁵ Algorithms.

Fortunately, some current work has been developed to deal with the above problems, especially the latter. For example, in [35], authors propose to simultaneously improve online handwriting prediction and character classification by combining cross-entropy loss with distance and similarity loss. In [36], authors design a dynamic routing protocol pattern to implement slot filling and intent detection. In [37], authors propose to use restricted softmax instead of standard softmax for label distribution non-IID data. However, these methods still aim to mine hidden common knowledge in all prediction tasks. Since the label sets among these tasks only partially overlap, or even have no overlap, they have less knowledge available than traditional MTL problems. It is not enough to obtain more available common knowledge from these multi-tasks for improving model performance in MTL by designing various learning mechanisms. Interestingly, in a recent study on domain adaptation [38], the authors argue that the generalization and effectiveness of feature representations can be further improved by transferring sufficient information from multiple source domains to the target domain. Inspired by this, we propose a new auxiliary task-based deep MTL framework (ADMTL), which assists the above tasks to learn efficiently by leveraging big tasks with sufficient information. This auxiliary task is not only abundant enough with the class information, but also covers all learning tasks. Specifically, as shown in Fig.1, we directly introduce a well-trained auxiliary network with the identical structure as the MTL network. Next, we design a novel knowledge selection strategy for extracting the available information from the auxiliary task network to assist in each task learning. The key idea of this strategy is to use a set of soft mask matrices to adaptively prune the neurons in the hidden layers of the auxiliary network for extracting available knowledge, and to construct the corresponding specific network for each task. Finally, we learn the ADMTL network in an end-to-end manner. In summary, the contributions of this paper are as follows:

Fig.1 Our proposed ADMTL networks. The network consists of three identical independent task networks. The left side is the auxiliary task network, and on the right is the multi-task learning network. $\textcolor [R G B] 0, 111, 198 ⇢ a n d \textcolor [R G B] 247, 3, 2 ⇢$ indicate the direction in which the knowledge of the auxiliary task is transferred to multiple tasks, and the long light blue box indicates shared knowledge

Full size|PPT slide

● A new framework is proposed to address MTL scenarios with partially overlapping and non-overlapping label sets. It assists in the efficient learning of such MTL scenarios by using the learned large auxiliary tasks with sufficiently abundant classes of information, and without adding any hyper-parameters.

● A novel knowledge selection strategy is designed for improving the generalization performance of each task. It adaptively prunes the hidden layers neurons in the auxiliary task network by introducing a set of soft mask matrices for extracting auxiliary knowledge, and constructs a corresponding specific network for each task.

● Extensive experiments on multiple datasets with different settings demonstrate the significant competitiveness of our model in comparison with the-state-of-the-art methods.

The rest of this paper is arranged as follows. In Section 2, we briefly review related work in multi-task learning. In Section 3, we introduce the architecture of ADMTL, give the definition and some related theoretical application analysis. In the experimental stage of Section 4, we present image classification results on benchmark data sets. Finally, we conclude in Section 5. The code is available at GitHub website.

2 Related work

MTL has good performance in many applications, especially in the field of computer vision, so it has attracted a lot of attention in recent years. In this section, we briefly review the related works of MTL based on shared task features and MTL based on shared model parameters.

2.1 MTL based on shared task features

The methods of this class usually assume that a common feature representation can be learned from individual tasks. According to the implementation manners, they likewise can roughly be divided into three sub-types:

1) Selective sharing of task features: for the tasks in the same subspace, they realize sharing by specifically regularizing the features among tasks. Typically, in [39], authors use the

ℓ_{2}

norm to regularize the task weight matrices to extract shared features for the test score prediction of most school students. In [40], authors use the

ℓ_{1, 2}

norm to regularize the weight matrices to extract shared features between tasks for learning multi-tasks with different feature dimensions. In [41], authors use

ℓ_{2, 1}

norm to regularize the weight matrices of various modal tasks to jointly select common features for multi-modal classification of Alzheimer’s disease.

2) Priori knowledge sharing of tasks: for the tasks defined in the same subspace, they use the same prior knowledge among tasks to realize sharing. Typically, in [42], authors embed prior knowledge (i.e., pathological images with different magnification belong to the same subclass) into the feature extraction process among different tasks to verify the relationship between tasks and pathological image categories for fine-grained classification and pathophysiological image classification. In [43], authors use a kind of meta data (i.e., contextual attributes) as a priori knowledge to capture the relationship between different tasks for multiple tasks clustering. In [44], authors use the same subclass of the gland area as the prior information in the convolutional neural network to guide the network inference for pathological colon image analysis.

3) Transformation sharing of task features: for the tasks represented in the same subspace, they realize sharing by performing the nonlinear transformations of the original feature representation among tasks. Typically, in [45], authors use a set of non-linearly transformed feature sharing units for image semantic segmentation and normal estimation. In [46], authors use the feature adapter to learn the non-linear transformation of the task features to automatically evaluate the child's speech ability.

2.2 MTL based on shared model parameters

The methods of this class usually associate different tasks with their partial model parameters or weights to realize sharing. According to their learning manners used, they can roughly be divided into three sub-types:

1) Weighted sharing of weight matrices: for the tasks represented in the same subspace, they realize sharing by weightedly combining a set of weight matrices among tasks. Typically, in [47], authors weigh the weight matrices among tasks for boundary classification of keywords. In [48], authors partition the weight matrices among tasks into common and private parts, then weight the common part for multi-label classification. In [49], authors weigh the weight matrices at the same spatial position in the pictures and transfer them to each task for image depth estimation, segmentation, and surface normal prediction.

2) Common factor sharing via decomposing individual weight matrices: for the weight matrix of each task model, they decompose these matrices into private and common parts, where the common part is used for sharing. Typically, in [50], authors decompose the weight matrices of multiple task models into common and private parts, and further use the common part for visual target tracking. In [51], authors sparsely decomposes the parameter tensor of the prediction model into multiple parameter matrices, and linearly combines the corresponding parameter matrices into a set of base matrices for sharing. In [52], authors decompose a collective matrix of drug-disease correlations to share the correlation matrix between them for drug discovery.

3) Low-rank structure sharing of model weight matrices: for the tasks represented in the same subspace, they capture the low-rank structure of the weight matrix among tasks by specifically regularizing to realize sharing. Typically, in [53], authors use feature tensor flattening of different tasks (i.e., a convex combination of matrix trace norms) to capture its low-rank structure for multi-task learning. In [54], authors use a set of low-rank matrices to capture the potential relationships between multiple tasks for Parkinson’s disease diagnosis. In [55], authors use a set of low-rank matrices constrained by the nuclear norm for target detection in hyper-spectral images. Our work follows the first line of research in that it extracts available knowledge from a well-trained large task model to assist in improving the predictive performance of the MTL model. First, we directly introduce a trained auxiliary large-task network with the identical structure as the MTL network. Then, we use a set of soft mask matrices to automatically extract available knowledge from the well-trained auxiliary task network and build a specific network corresponding to each task. Finally, end-to-end cross-task learning is performed on multiple task network.

3 Our method

In this section, we propose to leverage big auxiliary tasks with rich labels and class information to solve the MTL problem when labels overlap or even do not overlap labels between tasks. We first introduce the problem setting of MTL. Then, we will describe our method in detail according to Fig.2. The entire ADMTL network is learned adaptively without adding any hyper-parameters.

Fig.2 Illustration of the ADMTL network. In the auxiliary network, different colored cubes denote the knowledge that is extracted in each convolutional layer. In the multi-task learning network, all the different colored filled circles denote neurons, while the dashed circles are the pruned neurons. $⊙$ denotes the Hadamard product

Full size|PPT slide

3.1 Problem formulation

Given a big auxiliary task

T_{a u x}

and a dataset

D_{a u x} = {x_{i}, y_{i}}_{i = 1}^{N}

containing

N

samples with

x_{i} \in R^{d}

and its associated label

y_{i} \in {1, \dots, c}

, where

d

and

c

are the numbers of dimensions and classes in the dataset

D_{a u x}

, respectively. Meanwhile we are given

M

individual tasks

{T_{j}}_{j = 1}^{M}

, and corresponding training dataset

D_{j} = {x_{k}^{j}, y_{k}^{j}}_{k = 1}^{N_{j}}

with

N_{j}

samples,

x_{k}^{j} \in R^{d}

and its associated label

y_{k}^{j} \in {1, \dots, c_{j}}

, where

c_{j}

is the number of classes in the dataset

D_{j}

. We assume that there are

L

convolutional layers in the auxiliary network, where the feature maps in the

l

th layer are denoted as

F_{a u x}^{l} = {f_{a u x T_{1}}^{l}, f_{a u x T_{2}}^{l}, \dots, f_{a u x T_{M}}^{l}}

, and the feature maps and convolutional kernels of the corresponding

l

th layer in the multi-task network are

F_{m u l t i}^{l} = {f_{T_{1}}^{l}, f_{T_{2}}^{l}, \dots, f_{T_{M}}^{l}}

W_{m u l t i}^{l} =

{w_{T_{1}}^{l}, w_{T_{2}}^{l}, \dots, w_{T_{M}}^{l}}

, where

f_{a u x T_{1}}^{l}, f_{T_{1}}^{l} \in R^{W \times H}

, and

W

H

denote the width and height of the feature maps, respectively. Furthermore, we again assume that the classset

C_{T_{a u x}}

of the auxiliary task contains all the individual tasks classes, namely,

C_{T_{a u x}} = C_{T_{1}} \cup C_{T_{i}} \cup C_{T_{j}}, . . ., C_{T_{M}}

, where

C_{T_{i}}

and

C_{T_{j}} (i \neq j)

can be partially overlapped, or even non-overlapped. This makes ADMTL applicable under more general settings than most existing MTL methods.

3.2 Selective transfer of knowledge

The key to MTL is to improve the predictive performance of each task by using common knowledge among correlated tasks [56]. For the label sets in MTL with partial overlap or even no overlap, it means that less correlated knowledge can be shared among tasks, which hinders the effective learning of MTL models. To address these issues, we assisted these tasks learning by using its abundant class information in a learned big task. Specifically, on the one hand, the label sets only partially overlap in MTL, and we use them for better shared learning by selectively transferring more suitable knowledge from the big task. On the other hand, the label sets do not overlap at the MTL, and we also enhance the generalization of the feature representations for cross-task learning by selectively transferring more available knowledge from the big task. As shown in Fig.3, we design an auxiliary knowledge selection strategy that aims to extract auxiliary knowledge for each task and construct corresponding task-specific networks for them. To this end, we directly introduce a set of soft mask matrices

M^{l} = [\begin{array}{ccc} m_{11}^{l} & \dots & m_{1 W}^{l} \\ \dots & m_{h w}^{l} & \dots \\ m_{H 1}^{l} & \dots & m_{H W}^{l} \end{array}]

in the

l

th layer of the trained auxiliary network and initialize them randomly by [57]. Then, we normalize

M^{l}

to ensure its value is between 0–1, as follows:

Fig.3 Illustration of auxiliary knowledge extraction. The green square in the middle indicates the feature map in the auxiliary network in the $l$ th convolutional layer. The squares in the upper left and right corners indicate the soft mask matrices corresponding to the specific task, while the squares in the lower left and right corners indicate the extracted auxiliary feature maps

Full size|PPT slide

(1)

\begin{aligned} {\hat{m}}_{h w}^{l} = \frac{m_{h w}^{l} - min (M^{l})}{max (M^{l}) - min (M^{l})}, \end{aligned}

where

{\hat{m}}_{h w}^{l}

denotes the value of the element in row

h

and column

w

{\hat{M}}^{l}

after normalization.

The purpose of introducing large auxiliary tasks is to transfer the rich knowledge therein to the ADMTL network and to assist it in learning. However, due to the overwhelming amount of information in this task, directly and brutally transferring it to smaller individual tasks often brings unnecessary redundant information, which leads to performance degradation of the individual task networks. Thus, we adopt a selectable way to extract the auxiliary knowledge from the auxiliary network. Here, we perform probability sparse on

{\hat{M}}^{l}

, which is as follows:

(2)

\begin{aligned} {\tilde{m}}_{h w}^{l} = {\begin{array}{cl} m_{h w}^{l}, r a n d < {\hat{m}}_{h w}^{l}, \\ 0, r a n d \geq {\hat{m}}_{h w}^{l}, \end{array} \end{aligned}

where

r a n d

denotes a randomly generated threshold value in the range of 0-1, and

{\tilde{m}}_{h w}^{l}

denotes the element values of the

h

row and

w

column in the sparse

{\tilde{M}}^{l}

. Among all elements of

{\tilde{M}}^{l}

, we retain the elements when

r a n d < {\hat{m}}_{h w}^{l}

and place them at 0 for elements

r a n d \geq {\hat{m}}_{h w}^{l}

. In this way, we argue that the higher the element value in

{\tilde{M}}^{l}

, the more important the extracted knowledge is for the current task, its assistance is more effective. Conversely, they are not necessary.

Next, we use sparse

{\tilde{M}}_{T_{j}}^{l}

to extract the auxiliary knowledge for task-specific learning in the

l

th convolutional layer of the auxiliary network, as follows:

(3)

\begin{aligned} {\tilde{F}}_{T_{j}}^{l} = {\tilde{M}}_{T_{j}}^{l} ⊙ f_{a u x T_{j}}^{l}, \end{aligned}

where

f_{a u x T_{j}}^{l}

denotes the feature maps used for the auxiliary task

T_{j}

learning at layer

l

in the auxiliary network,

{\tilde{F}}_{T_{j}}^{l}

denotes the auxiliary features of task

T_{j}

in the

l

th layer, and

⊙

denotes the Hadamard product.

As in the aforementioned discussions, in order to solve the problem of less correlation knowledge caused by partially overlapping or non-overlapping label sets in MTL. We transfer the extracted knowledge to the task-specific network to help train it, as follows:

(4)

\begin{aligned} {\hat{F}}_{T_{j}}^{l} = {\tilde{F}}_{T_{j}}^{l} ⊙ f_{T_{j}}^{l}, \end{aligned}

where

f_{T_{j}}^{l}

denotes the feature maps of task

T_{j}

in layer

l

of the ADMTL network, and

{\hat{F}}_{T_{j}}^{l}

denotes the features of task

T_{j}

transferred in layer

l

of that network.

3.3 Learning across-task

In this section, we improve the generalization performance of ADMTL networks through the knowledge sharing. Inspired by [45], we adopt a similar way to the cross-stitch unit to achieve cross-task learning as follows:

(5)

\begin{aligned} [\begin{matrix} {\tilde{F}}_{T_{1}}^{l} \\ {\tilde{F}}_{T_{2}}^{l} \\ ⋮ \\ {\tilde{F}}_{T_{M}}^{l} \end{matrix}] = [\begin{matrix} λ_{11} & \dots & λ_{1 M} \\ ⋮ & ⋮ & ⋮ \\ λ_{M 1} & \dots & λ_{M M} \end{matrix}] [\begin{matrix} {\hat{F}}_{T_{1}}^{l} \\ {\hat{F}}_{T_{2}}^{l} \\ ⋮ \\ {\hat{F}}_{T_{M}}^{l} \end{matrix}], \end{aligned}

where the left side of the equation denotes the shared features in the

l

th layer of the ADMTL network, while the middle of the equation denotes the shared parameters of layer

l

, which are updated by the back propagation algorithm of the ADMTL network.

Finally, we construct the corresponding task-specific network by multiplying activations

{\tilde{F}}_{T_{j}}^{l}

and

w_{T_{j}}^{l}

, as follows:

(6)

\begin{aligned} F_{T_{j}}^{l + 1} = σ (w_{T_{j}}^{l} \otimes {\tilde{F}}_{T_{j}}^{l} + b_{T_{j}}^{l}), \end{aligned}

where

F_{T_{j}}^{l + 1}

denotes the input to layer

l + 1

of the task

T_{j}

network,

w_{T_{j}}^{l}

is the convolution kernel of this layer,

\otimes

denotes element-wise multiplication, and

σ

is the activation function, and

b_{T_{j}}^{l}

is the bias vector.

3.4 Objective function

In the ADMTL network, the objective function for task

T_{j}

can be formulated as:

(7)

\begin{aligned} L_{T_{j}} = - \sum_{k = 1}^{N_{j}} y_{k}^{T_{j}} (\log {\hat{y}}_{k}^{T_{j}}), \end{aligned}

where

{\hat{y}}_{k}^{T_{j}}

is the predicted output.

Finally, we define the total objective function of the whole network as:

(8)

\begin{aligned} L_{t o t a l} = \sum_{j = 1}^{k} L_{T_{j}} + {‖ M ‖}_{2}, \end{aligned}

where

‖ \cdot ‖_{2}

is the

ℓ_{2}

-norm.

The whole process of the proposed method to solve ADMTL is summarized in Algorithm 1.

Full size|PPT slide

4 Experiments

In this section, we report the results on multiple datasets to validate the effectiveness of the proposed method.

4.1 Experimental settings

Datasets We conduct experiments on 5 datasets, including:

ImageNet^†

See image-net.org website.

dataset is currently the largest computer vision dataset. The dataset contains 14,197,122 images and 21,841 Synset indexes, which mainly include: amphibian, animal, application, bird, covering, device, fabric, fish, flower, food, fruit, fungus, furniture, geological formation, invertebrate, mammal, musical instrument, plant, reptile, sport, structure, tool, tree, utensil, vector, vehicle and person. Each synset provides an average of 1,000 images. Each concept image is quality-controlled and human-annotated.

Office-Caltech^†

See people.eecs.berkeley.edu website.

dataset, which contains the Office-Caltech 10 dataset and the Office-Caltech 31 dataset, each of which has a total of 2,533 samples and consists of image datasets from three different databases: Caltech, Amazon and Webcam, with the smallest being 200 × 150 and the largest being 900 × 557.

Office-Home^†

See hemanthdv.org website.

dataset consists of four subsets of image datasets from different domains: Art, Clipart, Product and Real-World. Each subset has 65 different categories and 15,500 images, the image sizes are 117 × 85 and 4,384 × 2,686, respectively.

Caltech-256^†

See vision.caltech.edu website.

dataset is a very challenging dataset, which comes from Google Image Dataset and manually filters out all images that do not meet this category. The dataset has 256 object categories and contains a total of 30,607 images. The minimum number of images for any category increased from 31 to 80; the dataset also avoids artifacts due to image rotation and introduces a new, larger class of clutter to test background suppression.

Tiny–ImageNet^†

See kaggle.com/c/tiny-imagenet website.

dataset is a balanced and regular image classification data set provided by Stanford University. There are 200 balanced categories in the data set, each with 500 training images, 50 verification images and 50 test images, and the size of each image is consistent. In addition, the dataset also provides label categories and bounding boxes. The information on the above data set is shown in Tab.2. Furthermore, we split

70 %

of the data into the training set and the remaining

30 %

into the test set.

Tab.2 Characteristics of the experimental datasets

Data set	Train	Test	classes
ImageNet	$9, 800, 000$	$4, 200, 000$	1,000
Office-Caltech	$22, 752$	$372$	31
Office-Home	$15, 700$	$845$	65
Caltech-256	$1, 512$	$648$	256
Tiny-ImageNet	$100, 0000$	$20, 000$	200

In order to verify the generalization and effectiveness of the proposed method, we conducted two sets of multi-task learning experiments with the following setup:

Auxiliary task construction Due to the rich characteristics of the ImageNe dataset, we select it as the dataset for the auxiliary task in our experiments and make it contain the category information in all learning tasks. In addition, we pre-trained it using the identical network structure as the MTL network. In the experiment, we extract the corresponding auxiliary knowledge for each task directly from this auxiliary network to assist the joint learning of multiple tasks with partially overlapping or non-overlapping label sets.

Exp1 We construct an MTL task with partially overlapping label sets. The experiment consists of three groups of MTL classification tasks, each with two tasks, including Office-Caltech and Caltech-256, Amazon and Webcam, Dlsr and Product. There are only 10 classes overlapping in the label sets between each group of tasks. Details of this experiment are shown in Tab.3.

Tab.3 Summarize statistics for datasets where part of the label sets overlap

Data set	Features	Overlapping classes
Office-Caltech	$30, 000 \sim 501, 300$	10
Office-Home	$9, 945 \sim 11, 775, 424$	10
Amazon	$15, 596$	10
Webcam	$30, 000$	10
Dlsr	$30, 000$	10
Product	$9, 945$	10

Exp2 We also construct an MTL task with non-overlapping label sets. The experiment consists of three groups of MTL classification tasks, each with two tasks, including Art and Real World, Caltech-101 and Webcam, Amazon and Tiny-ImagNet. The label sets between each group of tasks do not overlap. Details of this experiment are shown in Tab.4.

Tab.4 Summarize the statistics of the data sets with no-overlapping label sets

Data set	Features	Overlapping classes
Art	$9, 945$	−
Real World	$9, 945$	−
Office-Caltech	$30, 000 \sim 501, 300$	−
Webcam	$30, 000$	−
Amazon	$15, 596$	−
Tiny-ImageNet	$784$	−

4.2 Comparison methods

We use common single-task [58] and multi-task [59] network architectures to train each task separately/jointly, and its experimental results serve as our single-task and multi-task baseline. Meanwhile, we compare our proposed method with other MTL methods including Cross-Stitch [45], NDDR-CNN [60], MTAL [12], LSSA [23], MCN [61] and MLwSGSU [36].

4.3 Implementation

In the contrasted deep neural network methods, we adjust the hidden units, learning rate, and the number of training steps in each layer according to the parameter settings of the corresponding reference. In ADMTL, we adjust the hyper-parameters in the same way. Specifically, we use VGG16 as the base network and set the input to 112

\times

112

\times

3 and the batch size to 16. In our experiments, to better train the ADMTL network we chose Adam as the optimizer and the rectified linear unit (ReLU) function as the activation function, with an initial learning rate of 0.001 and a decay of 50% each 30 iterations. All the deep learning models are implemented by PyTorch.

4.4 Comparison results

As shown in Tab.5, we report the test accuracy of each comparison method on three sets of MTL datasets, where the label sets between tasks partially overlapped. From the results, it can be observed that: 1) The MTL method ranks better than the single-task learning method on average in different multi-task groups. This shows that using the relationship between tasks to capture useful information of their interaction can promote the effectiveness of multi-task joint learning. 2) Different MTL methods have large gaps in their testing accuracy due to the difference in correlation between tasks. E.g., Cross-Stich and NDDR-CNN have the lowest accuracy rankings on Office-Caltech dataset in Group 1, while MLwSGSU and NDDR-CNN test accuracy ranks lowest on Caltech-256 dataset. 3) Since there is only partial label sets overlap between tasks, non-overlapping content may lead to large differences in the test results of different methods. E.g., MTAL ranks best in test accuracy in Group 3’s Dlsr dataset, while it is lower in the Product dataset. LSSA has the second highest test accuracy on the Product dataset and the lowest on the DLSR dataset. 4) Our proposed method significantly outperforms other methods on all data sets and achieves the best average ranking. This result shows that extracting and transferring features from a large auxiliary task can improve the performance of the MTL model.

Tab.5 Testing accuracy of each comparing method on Exp 1, where the optimal performances are bolded. The ranking and average ranking are reported in the corresponding bracket and the last column, respectively

Methods	Group 1		Group 2		Group 3		Avg Rank
Methods	Office-Caltech	Caltech-256	Amazon	Webcam	Dlsr	Product	Avg Rank
Single-task	$0.76 \pm 0.037 (6)$	$0.45 \pm 0.037 (9)$	$0.74 \pm 0.036 (5)$	$0.63 \pm 0.038 (9)$	$0.77 \pm 0.031 (5)$	$0.63 \pm 0.035 (9)$	7.16
Multi-task	$0.76 \pm 0.050 (7)$	$0.53 \pm 0.055 (5)$	$0.80 \pm 0.046 (3)$	$0.69 \pm 0.049 (3)$	$0.80 \pm 0.046 (3)$	$0.68 \pm 0.049 (5)$	4.33
Cross-Stich	$0.75 \pm 0.050 (8)$	$0.53 \pm 0.059 (6)$	$0.80 \pm 0.046 (3)$	$0.67 \pm 0.054 (8)$	$0.79 \pm 0.042 (4)$	$0.65 \pm 0.056 (8)$	6.16
NDDR-CNN	$0.75 \pm 0.054 (9)$	$0.51 \pm 0.055 (8)$	$0.79 \pm 0.042 (4)$	$0.67 \pm 0.044 (7)$	$0.75 \pm 0.049 (6)$	$0.67 \pm 0.050 (6)$	6.66
MTAL	$0.78 \pm 0.054 (5)$	$0.53 \pm 0.051 (4)$	$\underline{0.81 \pm 0.042} (2)$	$\underline{0.69 \pm 0.048} (2)$	$\underline{0.80 \pm 0.038} (2)$	$0.66 \pm 0.055 (7)$	3.66
LSSA	$\underline{0.80 \pm 0.058} (2)$	$0.54 \pm 0.373 (3)$	$0.74 \pm 0.180 (6)$	$0.69 \pm 0.155 (5)$	$0.61 \pm 0.428 (9)$	$\underline{0.73 \pm 0.337} (2)$	4.50
MCN	$0.79 \pm 0.341 (4)$	$\underline{0.59 \pm 0.277} (2)$	$0.73 \pm 0.450 (8)$	$0.68 \pm 0.138 (6)$	$0.63 \pm 0.265 (7)$	$0.70 \pm 0.272 (4)$	5.16
MLwSGSU	$0.79 \pm 0.216 (3)$	$0.53 \pm 0.421 (7)$	$0.74 \pm 0.285 (7)$	$0.69 \pm 0.103 (4)$	$0.63 \pm 0.346 (8)$	$0.71 \pm 0.050 (3)$	5.33
$A D M T L$	$0.91 \pm 0.019 (1)$	$0.82 \pm 0.197 (1)$	$0.90 \pm 0.022 (1)$	$0.81 \pm 0.133 (1)$	$0.88 \pm 0.346 (1)$	$0.92 \pm 0.097 (1)$	$1.00$

Tab.6 reports the test performance of each comparison method on 2 MTL tasks, where the label sets of different tasks do not overlap. From the results, it can be observed that: 1) In Group 1, Group 2 and Group 3 datasets, most of the MTL methods are due to single-task learning methods. 2) The test accuracy of the MTL methods varies widely between datasets in each group of MTL tasks. In particular, MLwSGSU in Group 2 ranks best on the T-ImagNet dataset, however, it ranks relatively low on the datasets Webcam and Amazon. This indicates that the presence of variability in the label sets between tasks has a greater impact on the performance of the model when they do not overlap at all. 3) Our proposed method achieves the best performance in most cases, with close performance rankings on each dataset, while achieving the best average rankings in both Group, Group 2 and Group 3 tasks. The experimental results again show that learning multiple tasks with non-overlapping label sets by introducing auxiliary tasks can effectively improve the performance of the MTL model.

Tab.6 Testing accuracy of each comparing method on Exp 2, where the optimal performances are bolded. The ranking and average ranking are reported in the corresponding bracket and the last column, respectively

Methods	Group 1		Group 2		Group 3		Avg Rank
Methods	Art	Real World	Office-Caltech	Webcam	Tiny-ImagNet	Amazon	Avg Rank
Single-task	$0.60 \pm 0.039 (7)$	$0.51 \pm 0.043 (7)$	$0.76 \pm 0.037 (4)$	$0.63 \pm 0.038 (6)$	$0.51 \pm 0.048 (7)$	$0.77 \pm 0.031 (6)$	6.16
Multi-task	$0.62 \pm 0.055 (5)$	$0.53 \pm 0.054 (4)$	$0.78 \pm 0.050 (3)$	$0.68 \pm 0.050 (3)$	$0.53 \pm 0.058 (5)$	$0.78 \pm 0.048 (5)$	4.16
Cross-Stich	$0.63 \pm 0.057 (4)$	$0.51 \pm 0.056 (8)$	$\underline{0.78 \pm 0.048} (2)$	$0.67 \pm 0.050 (4)$	$0.52 \pm 0.062 (6)$	$0.79 \pm 0.046 (4)$	4.66
NDDR-CNN	$0.59 \pm 0.056 (8)$	$0.52 \pm 0.056 (5)$	$0.76 \pm 0.051 (5)$	$0.66 \pm 0.050 (5)$	$0.46 \pm 0.461 (9)$	$0.79 \pm 0.044 (3)$	5.83
MTAL	$0.61 \pm 0.056 (6)$	$0.52 \pm 0.058 (6)$	$0.73 \pm 0.046 (6)$	$\underline{0.69 \pm 0.044} (2)$	$0.54 \pm 0.067 (4)$	$\underline{0.80 \pm 0.042} (2)$	4.33
LSSA	$0.69 \pm 0.348 (3)$	$0.57 \pm 0.712 (3)$	$0.46 \pm 0.103 (8)$	$0.50 \pm 0.836 (8)$	$0.57 \pm 0.564 (3)$	$0.68 \pm 0.125 (8)$	5.50
MCN	$0.59 \pm 0.056 (8)$	$0.52 \pm 0.056 (5)$	$0.76 \pm 0.051 (5)$	$0.66 \pm 0.050 (5)$	$0.46 \pm 0.451 (8)$	$0.79 \pm 0.044 (3)$	5.66
MLwSGSU	$\underline{0.70 \pm 0.149} (2)$	$\underline{0.58 \pm 0.521} (2)$	$0.58 \pm 0.028 (7)$	$0.63 \pm 0.493 (7)$	$\underline{0.57 \pm 0.169} (2)$	$0.75 \pm 0.651 (7)$	4.50
$A D M T L$	$0.88 \pm 0.026 (1)$	$0.74 \pm 0.414 (1)$	$0.93 \pm 0.043 (1)$	$0.83 \pm 0.389 (1)$	$0.91 \pm 0.029 (1)$	$0.90 \pm 0.065 (1)$	$1.00$

Furthermore, combining Tab.5 and Tab.6, we can observe: 1) MLwSGSU achieves better average ranking in experiment Exp2, but lower in Exp1. Similarly, LSSA is ranked higher in Exp1 and lower in Exp2. This indicates that these 2 MTL methods are only applicable to one scenario. 2) Most of the test performances of our proposed method on different datasets are better than other methods, especially in Exp1 for the Office-Caltech, Webcam and Product datasets, and in Exp2 for the Art, Office-Caltech and Office-Caltech data are significantly competitive. 3) Our method has the best average ranking for all MTL tasks. Fig.4 shows a performance comparison of the mean and standard deviation of the classification accuracy of various methods in Exp1 and Exp2. We observe that the overall performance of the ADMTL method outperforms the other methods. The above experimental results are consistent with our theoretical analysis.

Fig.4 (a) and (b) show the performance comparison of the mean and mean squared error of various methods on Exp 1 and Exp 2, respectively

Full size|PPT slide

4.5 Time cost comparison

As shown in Fig.5, we observe that 1) the single-task method takes the shortest time in Exp1 and Exp2, but it does not use shared information and thus has the lowest accuracy. 2) The average time consumption of MTAL and LSSA is less than that of ADMTL among MTL learning methods, because these methods use structured pruning techniques. 3) The average time consumption and test accuracy of ADMTL are better than those of MCN and MLwSGSU, especially in Exp1. It is worth noting that the average test accuracy of ADMTL is significantly better than other methods. The above results demonstrate the good effectiveness of our proposed method.

Fig.5 Compare the time cost (in minutes) of MTL on Exp 1 (a) and Exp 2 (b)

Full size|PPT slide

4.6 Visualization of knowledge extraction

To further verify the effectiveness of knowledge extraction by mask, we visualize it in Exp 1 and Exp 2, respectively. From Fig.6 we can observe that 1) for similar tasks (all images above the red line), there are fewer masks in the object area of the sample but many outside it, especially in the feature maps in the 1st and 3rd squares above the red line. These indicate that the mask matrix extracts suitable knowledge and it can be better for shared learning. 2) Similarly, for different tasks (all images under the red line), in the feature map in the penultimate row we can clearly discover that there are significantly fewer masks with aircraft areas than without areas it. The above experimental results were successfully verified to be consistent with our hypothesis.

Fig.6 Visual illustration of knowledge extraction. The top four rows of the red line are visualized in Exp 1 by the mask extraction of knowledge, while the bottom four rows are visualized in Exp 2 by the mask extraction of knowledge. Where in each 4 $\times$ 3 square rows 1 and 2 are masked auxiliary tasks and row 3 are the learning tasks

Full size|PPT slide

4.7 Ablation study

In this section, to further verify the effectiveness of the ADMTL network, we conducted experimental comparisons by ordinary MTL (denoted by O-MTL), ADMTL framework-based MTL the feature alignment (denoted by F-MTL) and ADMTL in Exp1 and Exp2, respectively. Tab.7 and Tab.8 report the performance of these three methods in terms of test accuracy. From the tables, it can be observed that: 1) The performance of F-MTL is lower than ADMTL and even inferior to O-MTL. The reason for this result may be that the MTL model adds more negative information unfavorable to task-specific learning during the training process, especially in the presence of negative transfer. 2) The performance of ADMTL is significantly better than that of O-MTL and F-MTL, especially in Exp2 where the test accuracy on all tasks is optimal. These results convincingly validate that the use of feature alignment is necessary and effective for transferring knowledge for large auxiliary tasks in MTL networks.

Tab.7 Comparison of learning performance on Exp 1 among ordinary MTL (O-MTL), ADMTL framework-based MTL no feature alignment (F-MTL), and ADMTL. The optimal performances are bolded

Methods	Group 1		Group 2		Group 3
Methods	Office-Caltech	Caltech-256	Amazon	Webcam	Dlsr	Product
O-MTL	$0.76 \pm 0.050$	$0.53 \pm 0.05$	$0.80 \pm 0.04$	$0.69 \pm 0.04$	$0.80 \pm 0.04$	$0.68 \pm 0.04$
F-MTL	$0.80 \pm 0.050$	$0.54 \pm 0.05$	$0.80 \pm 0.04$	$0.70 \pm 0.04$	$0.84 \pm 0.04$	$0.74 \pm 0.05$
$A D M T L$	$0.91 \pm 0.01$	$0.82 \pm 0.19$	$0.90 \pm 0.22$	$0.81 \pm 0.02$	$0.88 \pm 0.34$	$0.92 \pm 0.09$

Tab.8 Comparison of learning performance on Exp 2 among ordinary MTL (O-MTL), ADMTL framework-based MTL no feature alignment (F-MTL), and ADMTL. The optimal performances are bolded

Methods	Group 1		Group 2		Group 3
Methods	Art	Real World	Office-Caltech	Webcam	Amazon	Tiny-ImagNet
O-MTL	$0.62 \pm 0.05$	$0.53 \pm 0.05$	$0.78 \pm 0.05$	$0.68 \pm 0.05$	$0.78 \pm 0.04$	$0.53 \pm 0.05$
F-MTL	$0.67 \pm 0.05$	$0.60 \pm 0.04$	$0.80 \pm 0.05$	$0.72 \pm 0.04$	$0.81 \pm 0.04$	$0.54 \pm 0.06$
$A D M T L$	$0.88 \pm 0.02$	$0.74 \pm 0.41$	$0.93 \pm 0.04$	$0.83 \pm 0.38$	$0.90 \pm 0.06$	$0.91 \pm 0.02$

Tab.9 and Tab.10 report the comparison of the experimental results of the MTL methods using auxiliary tasks on EXP1, EXP2. Specifically, we compare the experimental results of two schemes for transferring auxiliary knowledge based on the ADMTL framework (denoted by T-ADMTL) and using ADMTL directly to transfer knowledge, respectively. From the above results, we can observe that; 1) in EXP1, T-ADMTL has slightly better test accuracy than ADMTL on datasets Office-Caltech, Caltech-256 and Amazon, while ADMTL outperforms T-ADMTL on datasets Webcam, Dlsr and Produc, especially on dataset Produc. This suggests that the use of auxiliary tasks in MTL learning can further improve its learning performance. 2) In EXP2, ADMTL performs slightly better than T-ADMTL on datasets Art, Real World and Office-Caltech, especially on dataset Art, while the difference between the test performance of T-ADMTL and ADMTL on datasets Webcam, Amazon and Tiny-ImagNet is small, especially on Webcam. This shows that transferring the selected auxiliary knowledge can also improve the generalization performance of MTL. In summary, there is no significant difference between the test accuracy of methods T-ADMTL and ADMTL, and both of them can improve the learning performance of MTL.

Tab.9 Comparison of learning performance on Exp 1 between ADMTL framework-based auxiliary knowledge transfer (T-ADMTL) and ADMTL. The optimal performances are bolded

Methods	Group 1		Group 2		Group 3
Methods	Office-Caltech	Caltech-256	Amazon	Webcam	Dlsr	Product
T-ADMTL	$0.93 \pm 0.25$	$0.82 \pm 0.05$	$0.91 \pm 0.06$	$0.81 \pm 0.30$	$0.87 \pm 0.09$	$0.88 \pm 0.04$
$A D M T L$	$0.91 \pm 0.01$	$0.82 \pm 0.19$	$0.90 \pm 0.02$	$0.81 \pm 0.02$	$0.88 \pm 0.34$	$0.92 \pm 0.09$

Tab.10 Comparison of learning performance on Exp 2 between ADMTL framework-based auxiliary knowledge transfer (T-ADMTL) and ADMTL. The optimal performances are bolded

Methods	Group 1		Group 2		Group 3
Methods	Art	Real World	Office-Caltech	Webcam	Amazon	Tiny-ImagNet
T-ADMTL	$0.85 \pm 0.16$	$0.74 \pm 0.30$	$0.93 \pm 0.30$	$0.83 \pm 0.19$	$0.93 \pm 0.02$	$0.91 \pm 0.20$
$A D M T L$	$0.88 \pm 0.02$	$0.74 \pm 0.30$	$0.93 \pm 0.03$	$0.83 \pm 0.38$	$0.91 \pm 0.02$	$0.90 \pm 0.06$

However, by comparing the number of network parameters and FLOPs in Tab.11, we found that: 1) In the identical MTL network structure, the number of parameters of all convolutional layers of ADMTL is less than that of T-ADMTL. 2) ADMTL has a significantly lower number of floating points in each convolutional layer than T-ADMTL. As shown above, ADMTL is superior to T-ADMTL, and it has better effectiveness and practicality.

Tab.11 In the experiment, methods T-ADMTL and ADMTL are compared between the number of parameters and floating point calculations

Layers	Params $(10^{3})$ $↓$		FLOPs $(10^{6})$ $↓$
Layers	T-ADMTL	ADMTL	T-ADMTL	ADMTL
conv_1	1.72	1.71	43.35	42.92
conv_2	73.72	72.06	462.42	451.99
conv3_1	294.91	288.88	462.42	452.97
conv3_2	589.82	575.83	924.84	902.24
conv4_1	1179.64	1145.46	462.42	448.48
conv4_2	2359.29	2299.50	924.84	900.03
conv5_1	2359.29	2285.66	231.21	224.67
conv5_2	2359.29	2303.18	231.21	225.36

4.8 Model convergence analysis

In this section, shown in Fig.7, the convergence of the ADMTL model on Exp 1 and Exp 2 is reported. On Exp 1, as shown in Fig. 7(a), the loss curves on datasets Caltech256 and Caltech-101 tend to converge after about 33 and 29 iterations, respectively; Figure 7(b) shows that the loss curves on the Amazon and Webcam datasets tend to converge after about 60 iterations; Figure 7(c) shows that the loss curves on the datasets Dlsr and Product tend to converge after about 58 iterations. On Exp 2, Figure 7(d) shows that the loss curves on datasets Art and Real World tend to converge after about 60 iterations; Figure 7(e) shows that the loss curves on datasets Office-Caltech and Webcam tend to converge after about 60 iterations; As shown in Fig.7 (f), the loss curves on the datasets Tiny-ImagNet and Amazon stabilize after about 30 and 24 iterations, respectively.

Fig.7 Illustration of the convergence of the model. Convergence curves of ADMTL on Exp 1 and Exp 2, respectively. (a) Loss curves on Caltech-256 and Office-Caltech; (b) loss curves on Amazon and Webcam; (c) loss curves on Dlsr and Product; (d) loss curves on Art and Real-world; (e) loss curves on Office-Caltech and Webcam; (f) loss curves on Amazon and Tiny-Imagnet

Full size|PPT slide

5 Conclusion

In this work, we provide a deep multi-task learning framework ADMTL, which is used to deal with multi-tasks with partial or unoverlapping label sets among tasks. Compared with the previous MTL method, ADMTL leverages big auxiliary tasks to jointly learn multiple tasks with partially overlapping or unoverlapping label sets. In addition, the auxiliary strategies in ADMTL can be flexibly embedded in other deep multi-task learning frameworks or transfer learning frameworks. In order to evaluate the performance of ADMTL, we conduct experiments on multiple public datasets and compared state-of-the-art MTL methods. Experimental results show that the ADMTL framework has significant advantages. In summary, our work can enrich MTL research to a certain extent from two aspects: 1) a novel adaptive MT learning mechanism is used to deal with multiple tasks when the label sets are partially overlapped or even unoverlapped. 2) A new knowledge extraction strategy that uses a set of soft masking matrices to adaptively prune the hidden neurons in the auxiliary task network to extract specific knowledge that assist the current task learn to form a corresponding network for each task. However, in this work, we haven’t solved the interpretable problems in MTL, and the following work will focus on such problems.