Quiz#

Harness Engineering#

Question 1: What is harness engineering?

  • A. A framework for fine-tuning LLM weights.

  • B. The practice of shaping the environment around AI agents so they can work reliably — including context, constraints, specs, evaluation, and runtime.

  • C. A testing methodology for web applications.

  • D. The process of building evaluation benchmarks for language models.

Answer: B

Question 2: Why are agent performance gaps “often harness problems rather than model problems”?

  • A. Because models are always perfect and never make mistakes.

  • B. Because infrastructure choices — context management, tool design, safety constraints — shape agent behavior as much as the model’s capabilities.

  • C. Because harnesses replace the need for a good model.

  • D. Because harness engineering only applies to weak models.

Answer: B

Question 3: What is context condensation in the context of long-running agents?

  • A. Deleting the entire conversation history when it gets too long.

  • B. Compressing completed subtasks and verbose outputs into summaries to free context budget while preserving key decisions.

  • C. Increasing the model’s context window size.

  • D. Splitting the context across multiple model calls without any summarization.

Answer: B

Question 4: An agent attempts to run rm -rf /. The harness blocks the action. Which harness component is responsible?

  • A. Context management

  • B. Specification files

  • C. Constraints and safety — specifically tool validation against blocked commands

  • D. Evaluation and observability

Answer: C

Question 5: What is the purpose of a scratchpad in agent harness engineering?

  • A. A temporary variable in Python.

  • B. An external file where the agent persists key decisions, findings, and progress so it can resume after context overflow or interruption.

  • C. A logging destination for error messages.

  • D. A UI element for the user to write notes.

Answer: B

Question 6: What should a CLAUDE.md specification file contain?

  • A. The model’s training data and hyperparameters.

  • B. Build commands, file conventions, constraints on agent behavior, and examples of desired output.

  • C. Only the project’s README content.

  • D. A list of all files in the repository.

Answer: B

Question 7: Why do agent benchmarks (like SWE-bench) depend heavily on the harness?

  • A. Because benchmarks measure harness latency, not model quality.

  • B. Because the same model scores differently with different tool sets, context strategies, and retry policies — so you are comparing harnesses as much as models.

  • C. Because benchmarks are designed by harness vendors.

  • D. Because benchmarks ignore the model entirely.

Answer: B

Question 8: When should a harness require human-in-the-loop approval?

  • A. For every single tool call, to maximize safety.

  • B. For sensitive or hard-to-reverse actions like file deletion, git operations, sending messages, or modifying infrastructure.

  • C. Never — autonomous agents should operate without human intervention.

  • D. Only when the model explicitly requests it.

Answer: B

Question 9: What is trajectory-level evaluation for agents?

  • A. Evaluating only the final output of the agent.

  • B. Scoring the entire sequence of actions — including efficiency, safety violations, and whether intermediate steps contributed to the goal.

  • C. Measuring the model’s training loss curve.

  • D. Counting the number of API calls.

Answer: B

Question 10: An agent working on a 50-file refactor runs out of context at step 30. What harness improvement would help?

  • A. Switch to a larger model.

  • B. Implement checkpointing and context condensation so the agent can persist progress to disk, compress old context, and resume.

  • C. Reduce the number of files to refactor.

  • D. Remove the system prompt to free tokens.

Answer: B

Question 11: What is the difference between a sandbox and a human-in-the-loop gate?

  • A. They are the same thing.

  • B. A sandbox isolates the agent’s execution environment (e.g., Docker, worktree) to limit blast radius; a gate pauses execution to ask a human for approval before a specific action.

  • C. A sandbox is for production; a gate is for development.

  • D. A sandbox blocks all actions; a gate allows all actions.

Answer: B

Question 12: Which metric best indicates that an agent is “wandering” rather than making progress?

  • A. Total token usage

  • B. Steps per task — a high step count relative to task complexity suggests the agent is retrying, looping, or taking unnecessary actions.

  • C. Task completion rate

  • D. Model temperature setting

Answer: B

Question 13: In multi-agent coordination, what role does shared state play?

  • A. It replaces the need for individual agent context windows.

  • B. It allows agents to exchange work products (plans, code, reviews) through files, databases, or message queues without needing to share context windows.

  • C. It synchronizes model weights between agents.

  • D. It stores the conversation history for all agents in one place.

Answer: B

Question 14: What is the recommended approach when a harness discovers a new failure mode?

  • A. Retrain the model to handle the failure.

  • B. Tighten the relevant constraint, add a test case, and update the spec — the harness evolves with the agent as you discover failure modes.

  • C. Switch to a different model.

  • D. Ignore it if it only happens rarely.

Answer: B