Self-Supervised Contrastive Learning: SimCLR, DINO, SimSiam, BYOL & MoCo

Project Overview

This self-supervised contrastive learning pipeline offers a unified framework for learning powerful image representations without labels. It supports five leading model families, each leveraging a distinct strategy to build invariance into the learned feature space:

SimCLR: Learns by pulling together different random augmentations of the same image and pushing apart other images, using a temperature-scaled contrastive loss to shape the embedding geometry.
DINO: A teacher–student approach that distills knowledge from a momentum-updated network into a student across multi-crop views. Particularly effective with Vision Transformers, it yields semantically rich feature maps.
SimSiam: Eliminates the need for negative pairs by predicting one view’s representation from another through a siamese network with a stop-gradient.
BYOL: Bootstrap Your Own Latent uses two networks (online & target) and an asymmetrical predictor to avoid collapse, learning by minimizing the distance between online predictions and target projections.
MoCo: Maintains a dynamic memory bank (queue) of embeddings and a momentum encoder, enabling large-scale contrastive learning with consistent negatives even on small batches.

Pipeline Capabilities

Configurable Augmentations: Swap in SimCLR’s heavy random crops, SimSiam’s minimal pipeline, DINO’s multi-crop strategy, or MoCo’s queue-based negatives via simple YAML toggles.
Scalable Training: PyTorch Lightning handles mixed-precision, multi-GPU, checkpointing, and logging to TensorBoard or Weights & Biases.
Flexible Backbones: Use ResNets or Vision Transformers; easily extend to custom architectures.
Embedding Extraction: Export high-dimensional feature vectors for downstream classification, retrieval, or clustering tasks.
Interactive Visualization: Generate 2D UMAP projections, nearest-neighbor galleries, and hexbin mappings of both observable and hidden (unobservable) properties.

Examples of Learned Representations

Below we showcase how the contrastive models capture meaningful structure in image data. Each example includes a UMAP projection and nearest-neighbor retrievals (or hexbin histograms) to illustrate clustering by morphology or physical properties.

🪼 Jellyfish Galaxies

Data Source: Zooniverse – Cosmological Jellyfish
Using galaxy cutouts from this project, we trained a SimCLR model to learn morphology-aware embeddings.

UMAP Projection

In this projection, galaxies with similar tail-like morphology cluster tightly, revealing the model’s ability to distinguish visual features purely from contrastive signals.

Nearest Neighbors Visualization

Nearest neighbors for jellyfish galaxies

For each query image, the top-10 nearest neighbors are shown with model-inferred “jellyfish” probability scores. High visual similarity and probabilities confirm robust clustering in embedding space.

🌌 X-ray Galaxy Clusters (TNG-Cluster)

Data Source: TNG-Cluster Simulations
We applied DINO on raw X-ray maps across multiple snapshots to uncover morphological groupings.

UMAP Projection

Distinct regions correspond to different cluster morphologies—relaxed, merging, or cool-core systems—demonstrating DINO’s capacity to encode high-level astrophysical features.

Nearest Neighbors Visualization

Query cluster (left) and its top-9 nearest neighbors reveal strong morphological consistency, supporting the embedding’s semantic organization.

Hexbin Histograms

Observables (e.g., X-ray luminosity, temperature) mapped onto UMAP bins.

Unobservables (e.g., merger stage, substructure metrics) reveal hidden physical correlations learned without direct supervision.