# Multi-task deep learning for glaucoma detection from color fundus images

### Hardware: REFUGE Challenge Dataset

In 2018 the Retinal Fundus Glaucoma Challenge (REFUGE) was launched as a satellite event during the MICCAI 2018 conference. For this event, 1200 retinal fundus images (400 for training, 400 for validation, 400 for testing) from different cameras and medical centers were collected and annotated by human experts. Annotations were provided for four different tasks: glaucoma diagnosis, optic disc segmentation, optic cup segmentation, and fovea localization. For the diagnostic task, the ground truth is provided in the form of binary labels, attesting to the presence of glaucoma. In segmentation tasks, the regions defined by the OD (optic nerve head) and OC (the white elliptical region inside the optic disc) are provided as binary segmentations. In the case of the localization of the fovea, the ground truth is given by the fovea (X, there) pixel location. All developed methods and experiments were performed in accordance with the relevant guidelines and regulations associated with this publicly available dataset.

### Methods

In the following, we describe the overall MTL deep learning architecture adopted, the loss functions used for each task, and finally the Independent Optimization (IO) strategy adopted.

We use a U-Net33an encoder-decoder convolutional network, with a VGG-1634 structure and added jump connections between equivalent encoder and decoder depths, which allows the decoder to recover fine detail thanks to the multiple scalings. This network is well known to efficiently solve biomedical segmentation tasks35. Although many variations of the U-Net architecture have been refined for different applications36we choose to use its main version using a VGG16 architecture, because it is the most used, and is a default choice for most applications36,37,38,39. Our MTL approach uses this architecture for two segmentation tasks (OD and OC), a regression task (fovea coordinates) and a classification task (glaucoma diagnosis). The MTL architecture design is shown in Fig. 5 and detailed below.

##### Optic disc and cup segmentation tasks

The OD and OC segmentation masks are obtained through the convolutional layer after the shared decoder for each task. Similar to existing works9the segmentations of OD and OC are refined by a post-processing step that keeps the principal component connected in the prediction map to remove possible prediction noise around these elliptical regions.

##### Location of the fovea

The fovea location task is treated as a segmentation task: from the ground truth coordinates of the fovea, a map is created, the center of this map represents the location of the fovea. The map is a coordinate-centered multivariate normal distribution (equal variances and zero covariances). An example is shown in Fig. 6 (right). The network is trained to fit maps with a task-specific convolutional layer on the shared decoder. The coordinates of the fovea are then predicted as the center of mass of the predicted saliency map. In this case, no refinement or post-processing is done because it can shift the center of mass.

The glaucoma detection (classification) task consists of two steps:

1. 1.

A prediction is obtained from a fully connected layer, branched after the U-Net encoder (FC classifier).

2. 2.

As well as some previous work9, a second prediction is obtained from a logistic regression classifier (linear classifier), taking as input the vertical Cup-to-Disc ratio (vCDR) obtained from the OD and OC segmentation tasks. The vCDR is calculated as follows:

begin{aligned} vCDR = frac{OC_{height}}{OD_{height}} end{aligned}

with (OC_{height}) and (OD_{height}) the heights of the OC and OD, obtained from the segmentation branches.

The outputs before the binary result of each classifier are averaged. The final ranking is obtained by using a threshold of 0.5 on this average.

#### Loss functions

Here, we present the loss functions used for the optimization of the different tasks.

##### OD and OC segmentation

Both OD and OC segmentation tasks use binary cross-entropy loss (BCE), averaged over each pixel I segmentation maps:

begin{aligned} mathscr{L}_{BCE}(p, y) = -frac{1}{N_{pix}} sum _{i=1}^{N_{pix}} y_i log (p_i) + (1-y_i) log (1-p_i) end{aligned}

with p, there and (N_{pixel}) respectively the prediction, the ground truth and the number of pixels.

##### Location of the fovea

For the fovea localization task, the network is trained to fit preprocessed saliency maps with a L1-loss, because card values ​​are not binarized:

begin{aligned} mathscr{L}_{L1}(p, y) = sum _i |y_i – p_i| end{aligned}

Then, the predicted location of the fovea is calculated as the center of mass of the predicted saliency map.

##### Classification of glaucoma

For the glaucoma classification task, a focal loss40 is used to better manage the imbalance between positive and negative samples (only (ten%) positive points):

begin{aligned} mathscr{L}_{Focal}(p, y) = (1-p_t)^gamma log(p_t) end{aligned}

with

begin{aligned} p_t = {left{ begin{array}{ll} &{} p quad text {if} quad y=1 &{} (1-p) quad text {else} end{array}right. } end{aligned}

Concretely, this loss multiplies the usual binary cross-entropy term by a classification uncertainty term ((1-p_t)) to give more importance to uncertain classifications, i.e. those of sparsely populated classes. We fix the hyperparameter (gamma) to 2 in our experiments.

#### MTL Independent Optimizer Optimization Strategy

In the following, we present the IO optimization strategy used in this work. It is based on the alternative optimization scheme, alternating independent gradient descent steps on the different task-specific objective functions, as proposed by Pascal et al.41. We then detail the main steps leading to this optimization scheme, and refer the interested reader to Pascal et al.41 for more details.

The standard MTL optimization setup with aggregate loss14 can be expressed as follows:

begin{aligned} mathscr{L}(w_t,xi _t)= sum _{k=1}^N c^{(k)} cdot mathscr{L} ^{(k)} (w_{t}, xi _{t}) end{aligned}

where (mathcal {L} ^{(k)}) is the loss function associated with (k^{th}) out of NOT Tasks, (w_t) shared parameters, and (xi _{t}) the data sample, at the iteration you. (c^{(k)}) are task-specific weights, for which we assume uniform weighting, i.e. (c^{(k)} = 1). Whether (g^{(k)}) stands for the derivative of (mathcal {L} ^{(k)}) against shared parameters wthe update rule for w at the stage (t + 1) using stochastic gradient descent is:

begin{aligned} w_{t+1} = w_{t} – eta _t cdot sum _{k=1}^N g ^{(k)}(w_{t}, xi _ {t}) end{aligned}

(1)

where (state) is the learning rate.

Recent works15,42,43 propose a variant of the update rule in Equation 1, in which alternative independent update steps with respect to different task-specific loss functions are performed, instead of aggregating all terms at the time. This strategy aims to minimize task interference and thus improve generalization. The alternative update rule can be expressed as follows:

begin{aligned} w_{t+1}^{(k)} = {left{ begin{array}{ll} w_{t}^{(N)} – eta _t cdot g ^{(k)} ( w_{t}^{(N)},xi _t), &{} k=1 w_{t}^{(k-1)} – eta _t cdot g ^{(k)} ( w_{t}^{(k-1)},xi _t), &{} forall k > 1 end{array}right. } end{aligned}

(2)

In this work, we adopt the approach of Pascal et al.41. It uses a modified alternative update rule (eq. 2) that allows the use of individual optimizers (IO) in the form of individual exponential moving averages for each job, to prevent edge optimizers (e.g. Adam) from accumulate and mix the previous gradient descent directions of all the different tasks. The modified update rule can be expressed as follows:

begin{aligned} w_{t+1}^{(k)} = {left{ begin{array}{ll} w_{t}^{(N)} – eta _t cdot hat{m}^{(k)} left( g^{(k)} ( w_{t}^{(N)},xi _t) right) , &{} k=1 w_{ t}^{(k-1)} – eta _t cdot hat{m}^{(k)}left( g^{(k)} ( w_{t}^{(k-1)} ,xi _t) right) , &{} forall k > 1 end{array}right. } end{aligned}

(3)

where (hat{m}^{(k)}) is a task-specific exponential moving average mechanism. Here, the term memory introduced by (m^{(k)}) involves only previous updates of the task k. Such a formulation is equivalent to using one independent optimizer per task, and is therefore denoted MTL-IO. In this article, we use MTL-IO to refer to the complete pipeline.

### Implementation details

All methods were implemented in Pytorch 1.2 and run on NVIDIA Titan XP graphics cards. Kaming uniform initialization44 was used for all baselines except network parts initialized with transfer learning. For the quintuple cross-validation, the validating splits were set to the official train and validating splits merged and mixed, while the testing split remained unchanged.