The line of works done on weaky supervision tasks by FAIR, almost all of them use multi-label softmax to produce the prediction outputs. From the text of the paper by Mahajan et al.:

Our model computes probabilities over all hashtags in the vocabulary using a softmax activation and is trained to minimize the cross-entropy between the predicted softmax distribution and the target distribution of each image. The target is a vector with k non-zero entries each set to 1/k corresponding to the k ≥ 1 hashtags for the image.

I am looking for ways to implement this as a function.