Hardware: REFUGE Challenge Dataset
In 2018 the Retinal Fundus Glaucoma Challenge (REFUGE) was launched as a satellite event during the MICCAI 2018 conference. For this event, 1200 retinal fundus images (400 for training, 400 for validation, 400 for testing) from different cameras and medical centers were collected and annotated by human experts. Annotations were provided for four different tasks: glaucoma diagnosis, optic disc segmentation, optic cup segmentation, and fovea localization. For the diagnostic task, the ground truth is provided in the form of binary labels, attesting to the presence of glaucoma. In segmentation tasks, the regions defined by the OD (optic nerve head) and OC (the white elliptical region inside the optic disc) are provided as binary segmentations. In the case of the localization of the fovea, the ground truth is given by the fovea (X, there) pixel location. All developed methods and experiments were performed in accordance with the relevant guidelines and regulations associated with this publicly available dataset.
Methods
In the following, we describe the overall MTL deep learning architecture adopted, the loss functions used for each task, and finally the Independent Optimization (IO) strategy adopted.
Multitasking Deep Learning Architecture
We use a UNet^{33}an encoderdecoder convolutional network, with a VGG16^{34} structure and added jump connections between equivalent encoder and decoder depths, which allows the decoder to recover fine detail thanks to the multiple scalings. This network is well known to efficiently solve biomedical segmentation tasks^{35}. Although many variations of the UNet architecture have been refined for different applications^{36}we choose to use its main version using a VGG16 architecture, because it is the most used, and is a default choice for most applications^{36,37,38,39}. Our MTL approach uses this architecture for two segmentation tasks (OD and OC), a regression task (fovea coordinates) and a classification task (glaucoma diagnosis). The MTL architecture design is shown in Fig. 5 and detailed below.
Optic disc and cup segmentation tasks
The OD and OC segmentation masks are obtained through the convolutional layer after the shared decoder for each task. Similar to existing works^{9}the segmentations of OD and OC are refined by a postprocessing step that keeps the principal component connected in the prediction map to remove possible prediction noise around these elliptical regions.
Location of the fovea
The fovea location task is treated as a segmentation task: from the ground truth coordinates of the fovea, a map is created, the center of this map represents the location of the fovea. The map is a coordinatecentered multivariate normal distribution (equal variances and zero covariances). An example is shown in Fig. 6 (right). The network is trained to fit maps with a taskspecific convolutional layer on the shared decoder. The coordinates of the fovea are then predicted as the center of mass of the predicted saliency map. In this case, no refinement or postprocessing is done because it can shift the center of mass.
Glaucoma detection task
The glaucoma detection (classification) task consists of two steps:

1.
A prediction is obtained from a fully connected layer, branched after the UNet encoder (FC classifier).

2.
As well as some previous work^{9}, a second prediction is obtained from a logistic regression classifier (linear classifier), taking as input the vertical CuptoDisc ratio (vCDR) obtained from the OD and OC segmentation tasks. The vCDR is calculated as follows:
$$begin{aligned} vCDR = frac{OC_{height}}{OD_{height}} end{aligned}$$
with (OC_{height}) and (OD_{height}) the heights of the OC and OD, obtained from the segmentation branches.
The outputs before the binary result of each classifier are averaged. The final ranking is obtained by using a threshold of 0.5 on this average.
Loss functions
Here, we present the loss functions used for the optimization of the different tasks.
OD and OC segmentation
Both OD and OC segmentation tasks use binary crossentropy loss (BCE), averaged over each pixel I segmentation maps:
$$begin{aligned} mathscr{L}_{BCE}(p, y) = frac{1}{N_{pix}} sum _{i=1}^{N_{pix}} y_i log (p_i) + (1y_i) log (1p_i) end{aligned}$$
with p, there and (N_{pixel}) respectively the prediction, the ground truth and the number of pixels.
Location of the fovea
For the fovea localization task, the network is trained to fit preprocessed saliency maps with a L1loss, because card values are not binarized:
$$begin{aligned} mathscr{L}_{L1}(p, y) = sum _i y_i – p_i end{aligned}$$
Then, the predicted location of the fovea is calculated as the center of mass of the predicted saliency map.
Classification of glaucoma
For the glaucoma classification task, a focal loss^{40} is used to better manage the imbalance between positive and negative samples (only (ten%) positive points):
$$begin{aligned} mathscr{L}_{Focal}(p, y) = (1p_t)^gamma log(p_t) end{aligned}$$
with
$$begin{aligned} p_t = {left{ begin{array}{ll} &{} p quad text {if} quad y=1 &{} (1p) quad text {else} end{array}right. } end{aligned}$$
Concretely, this loss multiplies the usual binary crossentropy term by a classification uncertainty term ((1p_t)) to give more importance to uncertain classifications, i.e. those of sparsely populated classes. We fix the hyperparameter (gamma) to 2 in our experiments.
MTL Independent Optimizer Optimization Strategy
In the following, we present the IO optimization strategy used in this work. It is based on the alternative optimization scheme, alternating independent gradient descent steps on the different taskspecific objective functions, as proposed by Pascal et al.^{41}. We then detail the main steps leading to this optimization scheme, and refer the interested reader to Pascal et al.^{41} for more details.
The standard MTL optimization setup with aggregate loss^{14} can be expressed as follows:
$$begin{aligned} mathscr{L}(w_t,xi _t)= sum _{k=1}^N c^{(k)} cdot mathscr{L} ^{(k)} (w_{t}, xi _{t}) end{aligned}$$
where (mathcal {L} ^{(k)}) is the loss function associated with (k^{th}) out of NOT Tasks, (w_t) shared parameters, and (xi _{t}) the data sample, at the iteration you. (c^{(k)}) are taskspecific weights, for which we assume uniform weighting, i.e. (c^{(k)} = 1). Whether (g^{(k)}) stands for the derivative of (mathcal {L} ^{(k)}) against shared parameters wthe update rule for w at the stage (t + 1) using stochastic gradient descent is:
$$begin{aligned} w_{t+1} = w_{t} – eta _t cdot sum _{k=1}^N g ^{(k)}(w_{t}, xi _ {t}) end{aligned}$$
(1)
where (state) is the learning rate.
Recent works^{15,42,43} propose a variant of the update rule in Equation 1, in which alternative independent update steps with respect to different taskspecific loss functions are performed, instead of aggregating all terms at the time. This strategy aims to minimize task interference and thus improve generalization. The alternative update rule can be expressed as follows:
$$begin{aligned} w_{t+1}^{(k)} = {left{ begin{array}{ll} w_{t}^{(N)} – eta _t cdot g ^{(k)} ( w_{t}^{(N)},xi _t), &{} k=1 w_{t}^{(k1)} – eta _t cdot g ^{(k)} ( w_{t}^{(k1)},xi _t), &{} forall k > 1 end{array}right. } end{aligned}$$
(2)
In this work, we adopt the approach of Pascal et al.^{41}. It uses a modified alternative update rule (eq. 2) that allows the use of individual optimizers (IO) in the form of individual exponential moving averages for each job, to prevent edge optimizers (e.g. Adam) from accumulate and mix the previous gradient descent directions of all the different tasks. The modified update rule can be expressed as follows:
$$begin{aligned} w_{t+1}^{(k)} = {left{ begin{array}{ll} w_{t}^{(N)} – eta _t cdot hat{m}^{(k)} left( g^{(k)} ( w_{t}^{(N)},xi _t) right) , &{} k=1 w_{ t}^{(k1)} – eta _t cdot hat{m}^{(k)}left( g^{(k)} ( w_{t}^{(k1)} ,xi _t) right) , &{} forall k > 1 end{array}right. } end{aligned}$$
(3)
where (hat{m}^{(k)}) is a taskspecific exponential moving average mechanism. Here, the term memory introduced by (m^{(k)}) involves only previous updates of the task k. Such a formulation is equivalent to using one independent optimizer per task, and is therefore denoted MTLIO. In this article, we use MTLIO to refer to the complete pipeline.
Implementation details
All methods were implemented in Pytorch 1.2 and run on NVIDIA Titan XP graphics cards. Kaming uniform initialization^{44} was used for all baselines except network parts initialized with transfer learning. For the quintuple crossvalidation, the validating splits were set to the official train and validating splits merged and mixed, while the testing split remained unchanged.