# candidate.md — Ming (Amory) Mu / 慕铭

**AI Product Manager** · sonikming@gmail.com · 1119532205@qq.com · github.com/AmoryMing · linkedin.com/in/amorymu · profile.mingmingzi.com

---

## What I do

I'm an AI Product Manager at Chinadaas. I build prototypes to validate ideas, then design the evaluation that says whether they're worth keeping.

What that looks like in practice:

- I argued, on the eval project, that the rubric was the bug, not the model. So I let the dimensions grow themselves on Langfuse instead of writing them up front. 11 came out of 10 traces. Three I would never have written.
- I extended qibook from "query companies" to "query people." That was a new business line, not a feature. Four directions failed before the 200-word brief shipped.
- I raised a Feishu-resident agent (OpenClaw) and a PM alter-ego, Amory. Skills came from mining my own Claude Code sessions, not from a list of "what a PM does." The 30-minute reporting flow is now one Feishu sentence.

I won't pretend everything works. The forum I designed has 8 users and 2 likes. Several of my own Skills aren't optimized to the last step. The eval methodology has a closed-loop step I haven't run yet.

---

## Experience

### Chinadaas (中数智汇) — AI Product Manager
**Jul 2025 – Present**

**Evaluation system for Qibaike's conversational product (self-initiated).**
The product launched without an acceptance bar. PMs said "the answers are bad" without evidence; engineers tweaked prompts with no way to tell whether things got better or worse. Rubrics had a short shelf life: moving from "reports" to "dialogue" invalidated half my dimensions in three weeks.

I split the annotation taxonomy by tool-call paths (19 simple + 19 complex query types, frozen after 6–9 iterations). Three-model consensus voting on labels (Qwen + GLM + Kimi for verdicts; Qwen-3.6 + GLM-5.1 + DeepSeek-V3.2 majority 2/3 for coverage). Then I stopped writing rubrics and let Langfuse surface them: 11 evaluation dimensions emerged from 10 real traces, three I would never have anticipated.

The number that mattered wasn't the score. 68% of the model's refusals were *false* refusals. The team had been optimizing the wrong layer. Potential lift: +23%.

**OpenClaw + PM alter-ego Amory.**
A Feishu-resident agent. Heartbeat protocol decides what's worth doing autonomously; circuit breaker stops it from burning tokens on retries. The Amory side came from behavior archaeology. I read my own Claude Code sessions and Redmine logs, found the actions I actually do, and packaged the high-frequency ones into Skills (briefing, requirement-writing, data analysis). My old 30-minute reporting flow is now one Feishu @-mention.

**qibook.com — new "people lookup" business line. 0 to shipped.**
The company could only query companies, but 7.8% of real user queries already wanted people lookup. No version got it right the first four times. Long reports nobody read. 22 visualizations that didn't converge. Five bank-specific versions, all dead-ended at PMF 5.25.

I reverse-distilled six identity classes by reading how bank relationship managers actually talk about people in deals, not by interviewing them. Then I walked the 19-API, 1,445-field data surface and matched it to 1,245 real user conversations to write the data-requirement spec.

The shipped version is 200 words. Three paragraphs. Six identity classes, four layout types, three control levels. A bank manager reads it in 10 seconds and moves on. The product has a data ceiling: public registry data shows stake and role, not intent. I'd rather say that than oversell.

**Qibaike GEO (AI search visibility, self-initiated).**
Searched our own product on Doubao. First page didn't have us. AI-readiness score 0.67/10. Three root causes, not one: SPA renders empty for AI crawlers, anti-scraping was on, and the site was closed to outside traffic on purpose because we're B2B.

SPA semantic-injection spec adopted by engineering. Then I went to the application-layer team with baseline data and made the case for opening the site to AI crawlers. That conversation was the harder part of the project. Reverse-ranked 246 AI search results to figure out where citations come from: 70%+ come from external platforms, not our own site. Optimizing your own site is necessary; it's not where the ROI is.

**Internal forum + bid proposal + tender knowledge base.**
Mapped Git verbs onto Agent collaboration on Discourse: regular agents do push / comment / pull / star; admin agents add review / merge / rebase. Scaffolded a ¥1.76M, 9-chapter bid proposal in one Claude session. Designed a tender knowledge base where AI updates the user profile after each interaction and the user confirms or rejects each change.

The forum has 8 users and 2 likes. I wrote the post-mortem anyway. The takeaway I now use as a rule: don't make UI decisions before you have user behavior data.

---

## Education

- **Middlebury Institute of International Studies (MIIS)** — M.A. in Translation & Localization Management (STEM) · 2023.09 – 2024.06 · GPA 3.9 / 4.0 (Sponsored exchange from BFSU, 2023–2024)
- **Beijing Foreign Studies University (BFSU)** — M.A. in Conference Interpreting · 2022.09 – 2025.06 · GPA 3.8 / 4.0
- **Xiamen University** — B.A. in English Literature (Top 5%) · 2018.09 – 2022.06 · GPA 3.9 / 4.0

---

## Independent projects

- **AI PMF Validator** — AI virtual focus group on the open-source OASIS platform. Five agents, five personas, one piece of marketing copy. They read it and react. Pre-launch mirror, not a focus group.
- **Life Simulator** — Conversational personality capture, 5-layer variable system, branching life simulation. The pitch isn't prediction. It's persuasion: show 100 parallel timelines and let the distribution do the talking.
- **deep-decode pipeline** — One article in, four formats out: long-form decode, infographic, Word doc, podcast audio. Three-source cross-verification.

---

## Skills

**AI & Agents** — Claude Code, Claude.ai, Claude App, MCP, multi-agent orchestration, prompt engineering, eval methodology, Langfuse, Coze, LangGraph

**Product** — PRDs, JTBD, A/B testing, annotation-taxonomy design, intent classification, scenario simulation, user research, cross-team influence, data-requirements methodology

**Languages** — CATTI Level 1 (translation), CATTI Level 2 (interpreting), native-level Chinese and English

---

## 中文简历

慕铭 / Amory Mu · AI 产品经理 · sonikming@gmail.com · linkedin.com/in/amorymu · github.com/AmoryMing

**我做什么**：自己 build 原型验证想法，再设计评估方法判断它好不好。

**教育**

- 蒙特雷国际研究院（MIIS）翻译与本地化硕士（STEM），2023.09 – 2024.06，GPA 3.9 / 4.0（北外交换派出）
- 北京外国语大学 会议口译硕士，2022.09 – 2025.06，GPA 3.8 / 4.0
- 厦门大学 英语文学学士（Top 5%），2018.09 – 2022.06，GPA 3.9 / 4.0

**经历（中数智汇 · 2025.07 至今）**

- 自主发起企百科对话评估体系。标注体系按工具调用拆，三模型共识投票，Langfuse 上让维度自然涌现：10 条 trace 跑出 11 个，3 个是预设 rubric 想不到的。最有用的发现是 68% 的拒绝是误拒，团队优化的是错的那一层。
- OpenClaw + PM 分身 Amory。心跳协议决定什么事值得自己做，熔断防 token 烧穿。Skill 是从我自己的 Claude Code 会话和 Redmine 日志里挖出来的。30 分钟汇报流程现在飞书一句话。
- qibook.com 查人新业务线。撞过四次墙：长报告没人读、22 版可视化没收敛、5 个银行部门定制版 PMF 全 Dead End、20 维度框架和\"10 秒读完\"矛盾。最后上线 200 字三段论。
- GEO 项目自己发起。三重根因：SPA + 反爬 + 网站对外不开放。SPA 语义注入研发采纳，跨团队推动网站开放。逆向 246 条 AI 搜索结果发现 70%+ 引用来自外部平台。
- 中数论坛 + 投标 + 标讯知识库。Git 动作映射成 Agent 协作语义。176 万 9 章标书一个 Claude session 跑完骨架。论坛只有 8 用户 2 赞，复盘照写。

**独立项目**：AI PMF Validator · 人生模拟器 · deep-decode 内容管线

**语言**：CATTI 一级笔译、二级口译，中英双语原生

_Last updated 2026-04._