WIRELOOPLABS

status:ACTIVEinterface:CLI + MCPfocus:harnesses, agents, local inference

The Lab · Field report

The Lab

We keep up with the frontier. We don't train models, we don't run a cluster, and we're not pretending to be one of the places that does. What we do is read the work, rebuild against it, and find out which parts survive contact with real projects.

The Lab exists because this field reinvents itself every few months and a stack that felt sharp last quarter is blunt today. When something shifts, we rebuild our harness around it and hand the working pieces to the Workshop.

Updated2026-04
ModeHarness engineering
StatusActive

Timeline

The road so far.

01
Copy-Paste Intelligence
GPT-4 in a browser tab, prompts pasted into VS Code by hand, loops closed with human patience. Crude, but enough to see the shape of what was coming.
02
API & RAG Era
Assistants API, vector search, retrieval pipelines, first real integrations. The model stopped being a chatbot and started being a component.
03
Reasoning & Local Models
o1/o3-style reasoning on one side, Ollama and Llama/Mistral/Gemma on local hardware on the other. Privacy and latency became design choices, not compromises.
04
Agent & Harness Era
Claude Code, Cursor, OpenClaw. The unit of work is no longer a prompt — it's a session, with tools, memory, and a long horizon.

Each shift forced us to rebuild our harness from scratch. That rebuild is what the Lab is for.

Core thesis

Harness engineering.

What we mean by harness

The harness is everything around the model: the prompts, the memory, the tools it can call, how routing decides which model handles what, how sessions persist, how outputs get verified, and how safety and failure modes are handled. The weights are one component in a much larger system, and the rest of that system is where most real-world behavior is actually decided.

Why it matters more than the model

Pan et al.'s work on natural-language agent harnesses shows smaller models with a stronger harness beating larger models on practical tasks — the scaffolding carries more signal than the parameter count.
LangChain's climb on TerminalBench didn't come from a new model; it came from harness changes — better tools, better loops, better verification around the same underlying weights.
Meta-Harness results suggest an optimized harness transfers across models: the work compounds, the model is swappable.

Where we focus

Memory and context management, tool and agent scaffolding, local inference and privacy-preserving deployments, and reliability — verification, evals, and the boring safety work that decides whether an agent is actually usable in production.

CLI as an agent-native interface

For agents, the command line is often a better surface than a GUI or bespoke protocol: commands have clear arguments, exit codes, and structured output, which makes failure detection and retries straightforward. Projects like CLI-Anything show how you can wrap any app or API in an agent-drivable CLI with predictable --json output and --help discovery, and plug it straight into frameworks like Claude Code or OpenClaw. We're moving our harnesses in that direction: CLI-first, with MCP and other protocols sitting on top, so the same contracts work locally, in CI, and inside autonomous agents.

Inventory

What we actually use.

Context & memoryObsidian + MCP for knowledge capture, Supabase for structured state, semantic search over both.
Agent scaffoldingOpenClaw for multi-channel agents, model-agnostic routing, custom session state.
ModelsClaude, Gemini, GPT-class frontier models for hard tasks; Mistral, Llama, Gemma via Ollama when it needs to stay local.
Voice & multimodalChatterbox TTS, Google Live APIs, streaming pipelines for real-time interaction.
Dev toolingClaude Code and Cursor for building, Perplexity for research, in-house harness patterns on top.
CLI harness layerUnix-style commands as the primary contract for agents, with CLI-Anything-style tooling to turn existing software and APIs into agent-drivable CLIs.

Status board

What we're tracking.

TrackStatusNotes

Local inferenceACTIVEGemma/Mistral on consumer GPUs; privacy-first deploys.

Harness & scaffoldingACTIVEOpenClaw iterations; verification and session memory.

Memory & contextACTIVEMCP + Obsidian + Supabase; long-horizon recall.

Voice & multimodalMONITORINGChatterbox, Live APIs; latency is the blocker.

Frontier modelsONGOINGClaude / Gemini / GPT updates; routing + evals.

Handoff

Into the Workshop.

The Lab figures out what works. The Workshop ships it. When a harness pattern survives a few projects — game narration, tutoring agents, multi-channel customer systems — it stops being research and becomes part of how we build.