Zach Nussbaum

Biography

Zach Nussbaum is a Principal Machine Learning Engineer at Nomic AI, where he leads the development of the Nomic Embed family of open-source text and vision embedding models. He is a co-creator of GPT4All, the pioneering open-source local LLM chatbot that became the third-fastest-growing GitHub repository of all time with over 77,000 stars and 250,000 monthly active users. Before Nomic, he worked at Deep Genomics on machine learning for drug discovery, contributing to BigRNA, a foundation model for tissue-specific RNA expression prediction. A former Division I baseball player at Davidson College, he also participated in open-community research with ML Collective and OpenBioML.

Local LLM InferenceOpen-Source Embedding ModelsReproducible ML TrainingMatryoshka Representation LearningMultimodal EmbeddingsMixture-of-Experts ArchitecturesDemocratizing AI AccessCode Retrieval & RankingComputational Biology & GenomicsWebGPU ML Acceleration

Timeline

12 Research12 total

2025

2025-01Research

CoRNStack paper accepted at ICLR 2025 — large-scale dataset for code retrieval powering nomic-embed-code

2025-02Research

Co-authored Nomic Embed Text v2 (MoE) — first general-purpose Mixture-of-Experts text embedding model supporting ~100 languages with 475M total / 305M active parameters

2024

2024-02Research

Lead author on Nomic Embed v1 paper — first fully reproducible, open-source, open-data 8192-context text embedding model outperforming OpenAI Ada-002

2024-02Research

Appeared on MAD Podcast with Matt Turck: 'How Nomic AI Is Driving The Open Source Revolution'

2024-02Research

Appeared on Weaviate Podcast: 'Matryoshka Embeddings' with Aditya Kusupati and Zain Hasan

2024-06Research

Co-authored Nomic Embed Vision — multimodal embedding model sharing latent space with Nomic Embed Text

2024-07Research

Published blog post on multimodal embedding observability

2024-11Research

Published blog post: 'Optimizing a WebGPU Matmul Kernel for 1TFLOP+ Performance'

2023

2023-03Research

Co-created GPT4All at Nomic AI — trained on ~1M GPT-3.5-Turbo prompt-response pairs collected March 20-26, 2023; released model, data, and training code openly

2023-09Research

Co-authored BigRNA preprint at Deep Genomics — an RNA foundation model for disease mechanisms and candidate therapeutics

2023-11Research

GPT4All ecosystem paper published at NLP-OSS @ EMNLP 2023

2021

2021-01Research

Presented 'A Tale of Two Long Tails' at ICML UDL Workshop

Key Contributions

GPT4All

Co-created the open-source local LLM chatbot that became the 3rd fastest-growing GitHub repo of all time (77k+ stars, 250k+ MAU). Demonstrated that capable assistant-style chatbots could run on consumer CPUs without API calls.

Nomic Embed

Lead author of the first fully reproducible, open-source, open-data text embedding model (v1) that outperformed OpenAI Ada-002. Evolved into v1.5 with Matryoshka dimensions, v2 with MoE architecture supporting ~100 languages, and a vision variant.

Nomic Embed Vision

Co-authored multimodal vision embedding model that shares the same latent space as Nomic Embed Text, enabling unified text-image retrieval with fully open weights and training code.

CoRNStack

High-quality contrastive dataset for code retrieval across multiple programming languages, accepted at ICLR 2025. Powers nomic-embed-code and CodeRankEmbed models.

Contrastors

Open-source PyTorch library for training contrastive models, used as the training framework behind the Nomic Embed model family.

BigRNA (Deep Genomics)

Contributed to training BigRNA, an RNA foundation model predicting tissue-specific RNA expression, splicing, microRNA sites, and RNA binding protein specificity from DNA sequence.

Notable Quotes

“

We describe the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small.

Nomic Embed paper, Feb 2024·Source

“

Large language models have recently achieved human-level performance on a range of professional and academic benchmarks. The accessibility of these models has lagged behind their performance.

GPT4All: An Ecosystem of Open Source Compressed Language Models, EMNLP 2023·Source

“

In contrast with other open-source models, we release the full curated training data and code to allow for full replication of nomic-embed-text-v1. This sets a new standard for transparency in the embedding model space.

Nomic Embed paper, Feb 2024·Source

“

GPT4All aims to democratize access to LLMs. Quantized 4-bit versions of the model allow virtually anyone to run the model on CPU.

GPT4All Technical Report, Mar 2023·Source

12 sources(click to expand)

Zach Nussbaum — personal website zanussbaum — GitHub profile GPT4All: An Ecosystem of Open Source Compressed Language Models (EMNLP 2023)GPT4All Technical Report (Mar 2023)Nomic Embed: Training a Reproducible Long Context Text Embedder (arXiv)Nomic Embed Vision: Expanding the Latent Space (arXiv)Training Sparse Mixture Of Experts Text Embedding Models (arXiv)nomic-ai/gpt4all — GitHub repository (77k+ stars)GPT4All — Nomic product page Matryoshka Embeddings — Weaviate Podcast (Feb 2024)How Nomic AI Is Driving The Open Source Revolution — MAD Podcast (Feb 2024)CoRNStack — ICLR 2025 dataset for code retrieval

Research generated March 19, 2026

AI Infrastructure & Inference/Zach Nussbaum

All Profiles