Principal ML Engineer |Nomic AI
Co-creator of GPT4All and lead author of Nomic Embed, building fully open and reproducible embedding models and local LLM infrastructure.
Biography
Zach Nussbaum is a Principal Machine Learning Engineer at Nomic AI, where he leads the development of the Nomic Embed family of open-source text and vision embedding models. He is a co-creator of GPT4All, the pioneering open-source local LLM chatbot that became the third-fastest-growing GitHub repository of all time with over 77,000 stars and 250,000 monthly active users. Before Nomic, he worked at Deep Genomics on machine learning for drug discovery, contributing to BigRNA, a foundation model for tissue-specific RNA expression prediction. A former Division I baseball player at Davidson College, he also participated in open-community research with ML Collective and OpenBioML.
Co-created the open-source local LLM chatbot that became the 3rd fastest-growing GitHub repo of all time (77k+ stars, 250k+ MAU). Demonstrated that capable assistant-style chatbots could run on consumer CPUs without API calls.
Lead author of the first fully reproducible, open-source, open-data text embedding model (v1) that outperformed OpenAI Ada-002. Evolved into v1.5 with Matryoshka dimensions, v2 with MoE architecture supporting ~100 languages, and a vision variant.
Co-authored multimodal vision embedding model that shares the same latent space as Nomic Embed Text, enabling unified text-image retrieval with fully open weights and training code.
High-quality contrastive dataset for code retrieval across multiple programming languages, accepted at ICLR 2025. Powers nomic-embed-code and CodeRankEmbed models.
Open-source PyTorch library for training contrastive models, used as the training framework behind the Nomic Embed model family.
Contributed to training BigRNA, an RNA foundation model predicting tissue-specific RNA expression, splicing, microRNA sites, and RNA binding protein specificity from DNA sequence.
We describe the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small.
Large language models have recently achieved human-level performance on a range of professional and academic benchmarks. The accessibility of these models has lagged behind their performance.
In contrast with other open-source models, we release the full curated training data and code to allow for full replication of nomic-embed-text-v1. This sets a new standard for transparency in the embedding model space.
GPT4All aims to democratize access to LLMs. Quantized 4-bit versions of the model allow virtually anyone to run the model on CPU.
Research generated March 19, 2026