Typical (Neural-Network-Based) Classification vs. Zero-Shot, Part 3 - Live Demo of CL Toy

Earlier entries: Part 1, Part 2

I added some more optional controls to the “contrastive loss toy model” from Part 1 that I want to demonstrate interactively, so this Part 3 post is in the form of a video:

Addenda:

(I’m new to this video-making thing and TBH not sure I want to get good at it. I realize I’m over-animated…in a way that doesn’t happen when I’m in front of a class, on stage or on TV!)

What I left out:

Various ways of getting a handle on choosing negative examples and doing so in a computationally efficient way (e.g. via Locality Sensitive Hashing). But the paper by Wu et al has a good survey of these issues – at least ca. 2017. And since then you can find newer treatments of such things as well, and I encourage you to do so – I’ll be joining you. Furthermore, if you follow the “Attract Only” methodology of SimCLR, then you wouldn’t have any negative examples anyway. ;-)
Messing around with the toy model – giving it “crazy” inputs like huge margins, whatever – is not only fun (to me) but can also be quite informative as a process of “discovery”: I’m sure I learned things in the context of experience that I might otherwise have read (and maybe missed) in a paper.

What said that came off as maybe misleading:

I didn’t mean to imply that CLIP used the “attract-only” scheme I attribute to SimCLR. CLIP has a “contrastive loss” which means negative examples. I was just talking about how the CLIP result on ImageNet demonstrated the utility of metric-based, semi-supervised learning.

References

Jordan, J. Personal communications. cf. Twitter Page: https://twitter.com/jeremyjordan.
Wu, C.-Y.; Manmatha, R.; Smola, A.J.; Krähenbühl, P. Sampling Matters in Deep Embedding Learning. In Proceedings of the arXiv:1706.07567 [cs]; January 16 2018.
“CLIP”: Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. ArXiv210300020 Cs 2021.
Fonseca, E.; Ortego, D.; McGuinness, K.; O’Connor, N.E.; Serra, X. Unsupervised Contrastive Learning of Sound Event Representations. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE, 2021.
Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality Reduction by Learning an Invariant Mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06); 2006; Vol. 2, pp. 1735–1742.
“SimCLR”: Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. ArXiv200205709 Cs Stat 2020.
Alamar, J. “Illustrated Word2Vec”, https://jalammar.github.io/illustrated-word2vec/
AK / akhaliq / ak92501, “Gradio Demo for VQGAN+CLIP”, HuggingFace Spaces, https://huggingface.co/spaces/akhaliq/VQGAN_CLIP