Vol. 01 · 2026

No. 01 · For Hire

From Kitsap, WA

You write tests for your code. Your AI feature has none. We build the eval harness, so you can change the prompt, swap the model, refactor the tools, and know before your customers do.

Wolf Peach Labs is the studio of Shimin Zhang, a software engineer with a data scientist’s training. Any team with Claude Code can get a model to do the thing once. The hard part is knowing it still does the thing after you touch it. We build the eval that turns “seems fine” into a number. And the working code that passes it. Two to four engagements a year, no subcontracting, no “discovery call.”

Currently

01Booking eval & AI engagements for Q3 2026.
02Building one product. In stealth.
03Replies sent within a day or two.

shimin@wolfpeachlabs.com

§ II · Services

Three kinds of work, all eval-first. Pick the closest.

No. 01 Cycle 6–12 wk

Ship the AI feature, eval-first.

What it actually is

Six to twelve weeks. We build the eval first: the dataset, the graders, the pass/fail bar. Then ship the feature against it. You get working code in your repo and the eval that proves it, so the next engineer can change it without holding their breath.

What it isn’t

A strategy slide, a rebuilt platform, or a pilot you have to operationalize later. We hand you a feature your team can safely change and keep shipping.

[†] More specifically…

Worse: the working prototype on a laptop nobody can deploy. The eval that runs locally and the team can’t reproduce. The agent loop that demos well and burns through your token budget the first time a real user asks a follow-up question. No metric would have caught it.

We say “this isn’t ready” before you have to.

No. 02 Cycle 2–3 wk

Second opinion before the rebuild.

What it actually is

Two to three weeks. Architecture review, eval audit, cost-of-quality model. An honest read of where the AI feature is brittle and whether your evals measure what you think they do, or whether “better” has been a vibe this whole time. In writing. No slide deck.

What it isn’t

Generic best-practice slides. We won’t tell you to add observability without saying which spans, which evals, and what the on-call should do at 2am when one of them flips.

No. 03 Cycle 8–16 wk

Greenfield v1 for founders.

What it actually is

Eight to sixteen weeks. From sketch to deploy. Web app, agent system, internal tool, the cloud and data plumbing under it. The eval ships with it, so the junior team that inherits it on day one can keep changing it safely.

What it isn’t

A throwaway prototype that demos well and doesn’t deploy. We’re building the thing your customers will use, the version that has to survive real traffic.

Or, ongoing.

No. 04 Retainer Quarterly · ~1 day/wk

Senior eval cover for the team shipping AI features.

What it actually is

One day a week, in your repo and your PRs. Eval design, the “does this regression matter” call, the “is this safe to ship” read on a Friday afternoon. Billed by the quarter, 30-day notice either way.

What it isn’t

A fractional-CTO retainer. No strategy days, no quarterly slide decks. The deliverable looks the same as the project work: code in the repo, evals running, the team able to do tomorrow what we did together today.

Engagements · 2–4 / year Senior people · No subcontracting Working hours · PT, async tolerant

§ III · Imprint · A Catalog of Two Wolf Peach · 2026 Open + Stealth

§ III · Products

The studio also ships under its own name.

Imprint
Wolf Peach Labs

Series
Two titles: one open, one in stealth

Format
Software, shipping 2026

inhabited-design, the open title. A Claude Code skill for landing pages: a designer and an ICP subagent in a loop, until a seven-factor rubric clears.

The other is in stealth: the eval platform behind the practice, for teams who train, evaluate, and ship language-model features in production.

We’ll say more about the stealth title when there’s a working build to look at. Until then there’s nothing worth showing. If you want a peek before that, drop a line.

No. 01 Untitled, in stealth 2026

No. 02 inhabited-design Open 2026

End of catalog

Building now ·drop a line for a peek. Cat. WPL-2026-01

§ IV · About

The studio of Shimin Zhang.

Shimin has been shipping software since 2012, but the through-line is measurement: a master’s from Georgia Tech specializing in machine learning, peer-reviewed research on learning at scale and biologically inspired design. Most recently he led Variance AI Explanation at FloQast from prototype to production. Its whole job was to measure and explain deviation. He also wrote the MCP server at Charter that lets AI coding assistants use the Kite Design System the same way the engineers do, and co-hosts Artificial Developer Intelligence, a weekly podcast on AI-assisted development for engineers who have to ship AI features for a living.

Wolf Peach Labs is what came after. A small number of engagements per year, by design, so the work stays honest and the relationships stay direct. No account managers, no juniors on the ticket, no rebadged subcontractors. Why only two. One project at a time, in your repo, written by the same person who replies to your email. A week of close reading and building the eval set before any architecture decision; days of on-call when the eval flips at 2am. Calendar holds two of those comfortably and four with care; ten would mean junior backfill and the kind of "we'll get back to you" that this whole studio exists to avoid.

A feature you can’t measure is a feature you can’t change.

The name is a footnote. wolf peach was the old European word for the tomato, back when it was suspected of being poisonous. Ordinary now. It took someone to be the first to eat one. We like that. Lat. lycopersicon, “wolf-peach.” Linnaeus, 1753. The tomato was widely held in northern Europe to be poisonous; it was eaten in southern Europe a century earlier. Common about 1820.

Entity Wolf Peach Labs LLC

Engagements 2–4 per year

Working with Founders, product teams, research groups

Operating posture Senior, in-repo, eval-first

Press & podcast Artificial Developer Intelligence (host)

[‡] Why only two

One project at a time, in your repo, written by the same person who replies to your email. A week of close reading and building the eval set before any architecture decision; days of on-call when the eval flips at 2am. Calendar holds two comfortably, four with care. Ten would mean junior backfill, the kind of “we’ll get back to you” this studio exists to avoid.

§ V · Get in touch

Email is the medium.

Tell me what you’re building, what you’re scared to touch, and roughly when you’d like to start. I read everything; I reply to everyone, usually within a day or two. If we’re a fit we’ll get on a call. If we’re not I’ll say so and point you somewhere better when I can.

shimin@wolfpeachlabs.com

No forms. No funnel. No discovery call. Replies sent from the same address.