通过爬山法评估优化你的提示词

通过爬山法评估优化你的提示词

了解比较评估的实用技巧，从而引导你完善提示工程，并为你的 App 选择最合适的模型。探索如何为性能建立基准、扩展评估策略，并将结果转换为 JSON 格式以便与其他工具集成。了解不同提示策略的适用场景，以及如何迭代优化提示词以获得最佳结果。

章节
- 0:00 - Introduction
- 2:42 - BookTracker's tagging problem
- 5:27 - Analyzing the evaluation results
- 8:26 - Drift between judge and human
- 9:37 - Measuring drift with Cohen's kappa
- 12:26 - Building a judge alignment evaluation
- 15:16 - Analyzing alignment failures
- 17:16 - Comparative evaluation: control vs experimental
- 19:12 - Refining the scoring dimensions
- 21:23 - Adding few-shot examples to the judge
- 23:38 - Going beyond prompts: adding a tool
- 27:17 - Next steps
资源
我叫 Marcus 是 Evaluations 框架团队的经理很高兴向你展示如何使用 Evaluations 改善你的智能功能你现在可能已经知道在 App 中使用 AI 是一种强大的方式可为用户提供全新的个性化体验这项技术能为你的 App 增添一定深度这是传统软件之前无法实现的但是如何判断你的智能功能在所有情况下是否都能按预期运行也是一项挑战为此我们推出了 Evaluations 框架为你提供所需工具让你满怀信心地发布满怀信心地发布不仅仅需要一个框架 Evaluations 框架还支持爬坡优化这是一个迭代改进的过程以评估分数为指引持续提升功能质量爬坡优化从开发阶段开始即做出你希望与现有功能对比衡量的变更完成所有变更后你需要运行评估查看结果是否符合你的预期然后分析结果以了解你的功能还有哪些可以进一步改善充分利用爬坡优化流程是系统化改进功能的好方法但有效的爬坡优化不只是遵循循环流程还需要一点点……科学思维因此在本视频中我将带你了解如何改进提示词通过遵循爬坡优化循环同时在过程中融入一些科学思维接下来我将带你了解如何进行比较性 Evaluations 让爬坡优化的过程更加轻松最后我们将超越单纯修改提示词通过改进智能功能的其他方面来提升质量但在继续之前本视频讲解的是对现有评估进行爬坡优化的过程这意味着你已经编写了评估流程的基础部分能够全面了解你的智能功能的优势和不足之处如果你还不熟悉如何做到这一点请查看我们的另一个视频 "认识 Evaluations 框架" 该视频涵盖了你需要了解的构建出色评估流程所需的一切知识介绍完这些我们开始吧在"认识 Evaluations 框架" 视频中我们介绍了 Book Tracker 如果你忘了 Book Tracker 可让读者对书籍进行整理和评价最近我读了很多经典著作已将这些书添加到我的书目中事实上我刚读完《金银岛》这是一本发人深省的读物探讨了忠诚与背叛之间的张力 Book Tracker 的新功能之一是标签服务使用模型根据读者评价生成标签虽然本次评价的标签涵盖了书籍的整体主题但我感觉少了些什么我本以为会看到像 "紧张"或"道德模糊"这样的标签这些标签能体现故事的主题《小妇人》生成的标签也有类似的问题 "感人"这样的标签更多反映的是读者的感受而非书籍的内容书评中的情感表达很好但不应该出现在标签列表中另外像"沉静稳重"这样的标签直接摘自评价内容当我日后想搜索书库时这样的标签用处不大看来 Book Tracker 的标签生成器还没有达到我认为应有的水准幸运的是我的同事在开发这个功能时编写了一个 Evaluation 用于根据一组标准衡量标签质量这是我针对 Book Tracker 书籍标签的 Evaluation 我特别想了解我们是如何评判标签质量的所以我向下滚动来查看 App 的定性方面由分数维度类型来捕捉相关性追踪标签对书籍情节信息的代表程度包括故事情节主题或其他相关信息实用性衡量标签作为搜索词的质量
ModelJudgeEvaluator 使用分数维度和提示词为每组标签生成分数我的计划是将这两本书添加到我的 Evaluation 中并查看返回的标签这也将是一个好机会了解我的评分与模型评判者评分的对比情况希望改进智能功能的某个部分是爬坡优化流程的起点因此要启动循环你需要从开发阶段开始在这里你对功能和 Evaluation 进行所需的任何更改在这个案例中我将把我对《金银岛》的评价添加到我的 Evaluation 数据集中我对《小妇人》也做了同样的处理现在这两个条目已在我的数据集中我想运行我的 Evaluation 看看模型生成什么标签以及评判者如何评分但为了到达那一步我们需要先问一问运行的 Evaluation 是否符合预期提醒一下你可以使用 Swift Testing 的 expect 宏来定义预期这样你就能通过测试是否通过来判断预期是否达成
在这个案例中我的 Evaluation 满足了所有预期但由于我知道标签还没有达到我希望的水准我需要进一步调查这就进入了分析阶段 Xcode 的新评估报告为我提供了深入信息关于我上次 Evaluation 运行的情况要查看更多详情我可以点击我的 BookTaggingEvaluation 运行记录这会打开评估详情视图顶部显示的是汇总指标图表下方是结果表格
我现在想做的是比较模型的响应与我之前生成的预期标签列表我可以打开 Assistant Editor 来完成现在我可以看到详细信息关于数据集中每个条目生成的标签我想关注的是模型生成结果与我预期之间的差异我可以在这张表中详细查看这组词条还不错但遗漏了一些关键细节这些细节来自故事用户可能会搜索因此我会给这些标签相关性打 4 分实用性打 2 分我的模型评判也给标签相关性打了 4 分很好但它给实用性也打了 4 分这是不对的我应该看看对于我的《小妇人》书评标签这些标签并没有包含我期望的所有有用信息结果我也不认同评判在这里给出的分数再次强调我认为相关性应为 4 分实用性应为 2 分这次分析让我清楚地看到我的模型评判与我在评估标签时存在差异模型与人类之间的这种差异被称为偏差这是所有开发者都面临的问题在评估智能功能时原因如下假设我有一个包含 10 个样本的评估我让模型评判和人工分别对每个样本打分模型和人工按照 1 到 4 的分数进行评分最后我们对分数取平均值以生成汇总结果如果模型和人类在评分上倾向于不一致那么他们的平均分将会彼此偏离这就是偏差随着数据集不断增长偏差也会越来越大到那时你将很难判断你的功能是否得到了正确的评估为了解决这个问题你可以将评判对齐到专家的意见现在我们知道偏差是个问题我们需要一种方法来了解我们的模型评判偏离专家评分的程度一种实现方式是将专家的评分并排对齐并标记两者匹配的位置然后用此生成百分比这个百分比称为准确率这是衡量对齐度的好方法前提是评分标准中每个值出现概率相等然而你的数据集更可能包含分数分布不均的值分数分布不均匀想想看数据集通常包含高质量输出的示例因此通常情况下人工评分者往往会给数据集中的条目打较高的分数如果模型恰好对较小的数据集给出了高分看起来两者似乎是对齐的但当应用于更大的数据集时分数变化更多它倾向于高分的习惯仍然会导致偏差所以我们需要准确率的替代方案一种能够考虑到数据集加权特性的方案以及模型可能猜对答案的概率幸运的是有解决方案 Cohen's kappa 系数是一个数学公式由统计学家兼心理学家 Jacob Cohen 于 1960 年提出 Cohen's kappa 衡量对齐度即两个评分者多久达成一致为此我们需要知道评分者达成一致的比例即准确率这正是之前准确率指标所计算的内容但现在我们需要计算一个新值巧合概率表示某个评分者可能碰巧达成一致的机会这种运气会根据概率进行加权某些答案更有可能出现那么问题来了如何计算它要计算对齐度我们从准确率分数开始从准确率分数中减去两个评分者随机达成一致的可能性最后将差值除以随机一致的倒数即两个评分者有意达成一致的概率结果就是对齐度 Cohen's kappa 是衡量对齐度的强大方法用于衡量模型评判与专家意见之间的一致性我可以用它来逐步提升对齐分数在我和我的模型评判之间现在我们回到爬山循环的起点进入开发阶段为此我将设置一个评估将我的评分与评判对比并生成对齐分数为此我需要编写一个评估由四个组件构成首先是数据集然后是评估的对象接下来我需要定义评估器最后我需要汇总结果那么我们从数据集开始为了让评估正常工作我的模型评判和我需要评估完全相同的数据集本例中模型评判审查标签所以我需要生成一组通用标签供评判和我共同审查我正好有完美的数据集我之前的评估包含一组书评和标签因为我在测试中运行了这个评估 Xcode 生成了一个附件包含所有生成的评估数据我可以获取该附件并提取摘要和标签对提取摘要和标签对之后我需要添加我的评分完成之后我可以将此文件的内容作为评估的输入接下来我需要捕获评估对象通常 subject 方法用于调用与功能相关的 API 与功能相关的 API 但由于生成的模型响应是数据集的一部分我们可以直接返回已生成的标签现在我需要定义我的评估器你可能已经猜到了我的评估器与书籍标签评估中完全相同的模型裁判评估器与书籍标签评估中的相同裁判在此处提供其评分最后我需要汇总结果在这里我们将我的评分与裁判的评分进行比较为此我们需要计算 Cohen's kappa 我可以通过自定义聚合方法来实现除了 Cohen's kappa 之外我还会计算均值以及每个评分维度的标准差这有助于了解裁判的评分是升高还是降低现在我可以将测试与评估进行关联对于此测试我设置了预期我的评分与裁判的评分应产生 0.6 的对齐分数我们选择这个数字是因为据统计学家所言 0.6 的对齐分数代表有意义的一致程度现在是时候进行评估了并为我们的对齐获取基准然后确定我的评估是否符合预期测试似乎失败了这意味着我的预期未能得到满足因此再次需要详细分析结果我现在了解到我的对齐分数未能达到预期我可以前往评估报告获取更多信息正如我所预期的实用性和相关性的分数都相当低这意味着我的模型裁判与我不一致现在我想获取更多信息了解数据集中每个样本的表现为此我需要打开助手并详细查看结果在浏览结果时这篇《科学怪人》的评论引起了我的注意我可以看到我对标签的评分与裁判的评分存在相当大的差异裁判似乎认为 self-help 之类的标签和 self-improvement 与故事相关 psychological 也是一个可以接受的搜索词但用户不太可能搜索这个词然后我开始查看数据集中的其他条目存在类似问题的条目并发现了这篇《拉玛坚》的评论裁判和我都认为这些词语是有帮助的且与书籍内容相关我们的分歧在于实用性 visual-dimension 和 quaint-dignity 过于具体那么问题出在哪里呢我认为模型本身没有足够的知识来区分好标签与坏标签这可能是因为我的裁判提示没有提供足够的上下文为此我需要开发一个新提示这样我就可以比较当前提示的对齐分数与新提示的分数幸运的是在 Xcode 27 中我们可以相互比较两次评估的结果在进行比较时一些科学思维会大有裨益在科学实验中有两个组对照组代表基准以及实验组代表我们试图比较的变化我们可以用同样的方式来理解两个版本的提示其中对照组由基础提示表示实验组由我们新修改的提示表示我现在需要创建第二版评估使用实验性提示作为基准我们将使用与之前相同的评估和相同的模型裁判提示对于我们的实验性提示我撰写了更详尽的描述说明如何评判这组标签首先为裁判提供关于 App 的上下文以及即将评判的内容然后给出好标签的示例以及识别坏标签的方法两个提示都写好后我可以将两个评估都添加到测试套件中这将运行两个评估我现在就运行该套件并比较结果评估完成后我可以返回评估报告我的相关性对齐分数似乎有所提升而实用性对齐分数则大幅下降平衡这样的权衡很棘手因此我需要仔细考虑如何推进但在深入分析之前先检查是否通过我的测试证实了这一点我们没有通过进一步思考之后我打算保留这次提示修改并将下一轮迭代的重点放在提升实用性分数上因此审查结果最有效的方式是将两位裁判的实用性分数相互比较为此我可以使用评估报告中新的比较视图在评估报告中我可以点击比较按钮并打开我的基准评估在这里我可以并排查看两个提示的分数有一件事立刻引起了我的注意就是实用性分数之间的差异这篇《道林·格雷的画像》的评论中模型在有用性上评分似乎过于苛刻实验评估中的有用性列似乎印证了我的猜测我注意到所有分数不是3分就是2分这评分实在太严格了我认为有一个方法可以改善这个问题就是更具体地说明如何给每个维度打分为此我需要对实验评估做一些修改但在修改实验评估之前我将新的提示词应用从实验评估导入到我的基准测试这样可以确保只有一个不同的变量也就是对评分维度所做的修改对于相关性我提供了一个稍长的描述强调了需要包含类型标签的要求这是有用性的描述强调对过于具体的标签要更加严格批判我再次等待评估运行完成两项分数相比基准都大幅提升了看来这些具体的评分维度将会更有帮助但我们还没完全达成对齐目标所以现在我需要再做一次对比来找出还有哪些地方可以进一步改进为了深入分析我回到了实验评估环节我想详细审查结果所以我打开了助手视图翻看结果时我找到了《白鲸》的评价我的相关性得分开始趋于对齐了但我的有用性得分还需要继续改进部分结果看起来不错另一些仍相差甚远《科学怪人》这篇评价仍然让模型评判困难我认为模型评判现在需要一些我的判断方式示例以便为它提供一个按我标准评判的参考模式这意味着我们需要再进行一轮爬坡优化我已经将新的评分维度添加到基准评估中现在我重新调整了主评判提示词以便为它提供更多关于目标的细节关于标签生成功能帮助模型深入理解问题所在的情境在此基础上我编写了一系列示例供模型用作评审的参考指南我只给了模型少量示例如果给它更长的列表我容易使对齐分数过拟合这样就很难判断模型评判是否真正与我的标准对齐现在对比已经公平了我需要运行评估并查看结果现在我的分数终于超过了预期值这意味着我终于通过了可以退出循环了这意味着我可以确信当模型评判提供评分时我可以有把握地说标签是否符合我的标准这意味着我现在可以让模型评判开始工作了来评估 Book Tracker 的 Book Tagging Service 到目前为止我们已经了解了如何对提示词进行爬坡优化让它们逐步越来越好现在我想展示给你看如何改进你的功能通过提示词之外的方式来实现提升为了生成标签 Book Tracker 使用了设备端模型我们使用它是因为读者在整理书籍时常处于各种环境使用设备端模型可以确保他们能够生成标签无论身处何地我想做的是为模型提供更多关于书籍的上下文用于生成标签我认为额外的上下文将帮助模型生成更相关且更有用的标签更好的是 Book Tracker 已经拥有所需的数据因为我们存储了作者姓名和书名在用户写评价时就会记录因此为了帮助标签生成器我创建了一个工具来获取书籍的更多信息它可以在信息可用时提供书名和作者添加这个工具是一种爬坡优化因为我们在尝试通过渐进式改变来提升功能的质量对于这次评估我们将使用书籍标签评估现在配合了改进后的模型评判但我需要一种方法来比较我的功能不带工具时与带工具时的质量差异为此我需要对 Book Tagging Service 做修改 BookTaggingService 现在接受工具列表作为输入我还将默认值设为空数组这样我现有的评估就不需要做任何改动但现在我需要编写一个新的评估来比较带工具和不带工具时服务的表现差异这是我编写的新评估它与另一个评估完全相同唯一区别是我现在将新的查找工具传入工具数组所以我只需定义两个评估实例一个不带工具一个带工具现在我们来评估它并判断是否可以发布我使用工具的服务满足了所有预期情况看起来不错但是我的 Book Tracker 数据集只包含13对书籍和评价这并不能覆盖各种各样的书籍和评价用户可能提交的内容此外我在查看我的服务评估结果时发现带工具的情况我能看出带工具的服务表现更好但看起来我的工具并没有在所有我认为需要的地方被调用我真正需要的是一种方法来判断我的工具是否在正确的情况下被调用幸运的是 Evaluations 框架可以帮助解决这两个问题要了解我们的 API 用于评估工具使用情况以及生成全面数据集的方法请参阅《为智能体 App 构建稳健评估》视频你将了解工具调用 Evaluators 的相关内容以及如何使用 Sample Generator API 测试你的 App 可能遇到的各种用例在结束之前我想回顾一下今天的内容爬坡法最有效的方式是每次只做一处改动为此请将循环的每次迭代视为科学实验能够隔离你的改动有助于你理解功能的每个部分对整体质量的贡献了解每个部分的单独工作原理也有助于你判断在哪些地方需要改动以解决后续出现的缺陷或不良模式
其次这个过程需要时间并非每次改动都能带来正向变化但失败的实验与成功的实验同样有价值第三好的实验需要创造力在智能功能中有很多可以改变的地方在你的功能中你可以更改指令工具以及用于生成响应的一个或多个模型在评估方面你可以更改数据集聚合方法甚至可以更改 Evaluators 本身一切皆可尝试在考虑如何爬坡时请务必考虑所有这些因素最后要注意漂移评估你的 Evaluators 可能感觉有点绕但经过良好调优的模型 Evaluator 能为你节省长期时间模型生成评分的速度远快于人工评分因此通过保持对齐随着数据集扩展覆盖更多用例你能获得有用的信号如果你想进一步了解今天所涵盖的内容可以查看我一直在使用的 Book Tracker App 以及用于对齐模型评判器的评估内容你还可以获取关于所有新 API 的完整说明请访问开发者文档网站感谢你花时间学习如何通过爬坡法提升你的评估分数你的付出终将得到回报你将为用户带来高质量的体验感谢收看祝爬坡愉快

// MARK: - Evaluation
  struct BookTaggingEvaluation: Evaluation {
      func subject(from sample: ModelSample<BookTags>) async throws -> ModelSubject<BookTags> {
          let result = try await BookTaggingService.generateTags(for: sample.promptDescription)
          return ModelSubject(value: result)
      }

      // MARK: - Dataset
      var dataset = ArrayLoader(samples:
          Book.sampleBooks.map { book in
              ModelSample(prompt: book.review, expected: BookTags(tags: book.tags))
          }
      )

      // MARK: - Evaluators & Metrics
      var tagCount = Metric("Tag Count")
      let hasGenreTag = Metric("Has Genre Tag")
      let noDuplicates = Metric("No Duplicates")

      let relevance = ScoreDimension(
          "Relevance",
          description: """
              Whether each tag describes a quality, theme, or tone of the
              book itself rather than incidental details or the reader's
              personal reactions.
              """,
          scale: .numeric([
              4: "Every tag describes the book itself",
              3: "Most tags describe the book, one picks up a reader reaction or minor detail",
              2: "Most tags are surface details or personal reactions, not book descriptors",
              1: "Tags don't meaningfully describe the book"
          ])
      )

      let usefulness = ScoreDimension(
          "Usefulness",
          description: """
              Whether tags are at the right granularity for browsing — broad
              enough that multiple books could share the tag, specific enough
              to help filter.
              """,
          scale: .numeric([
              4: "Every tag could group multiple books while still narrowing a search",
              3: "Most tags are at the right level, one is either too broad or too narrow",
              2: "Most tags are too broad to filter or too narrow to group",
              1: "Tags would not help with browsing"
          ])
      )

      var evaluators: Evaluators {
          // 1. Tag count is within the required 3–8 range
          Evaluator { _, subject in
              let count = subject.value.tags.count
              if (count >= 3 && count <= 8) {
                  return tagCount.passing(rationale: "\(count) tags")
              }
              return tagCount.failing(rationale: "Got \(count) tags, expected 3–8")
          }
  
          // 2. At least one tag identifies the genre or literary form
          Evaluator { _, subject in
              let tags = subject.value.tags.map { $0.lowercased() }
              let knownGenres = await BookTaggingService.knownGenres
              for tag in tags {
                  if knownGenres.contains(tag) {
                      return hasGenreTag.passing(rationale: "Matched \(tag)")
                  }
              }
              return hasGenreTag.failing()
          }

          // 3. No duplicate tags
          Evaluator { _, subject in
              let uniqueCount = Set(subject.value.tags.map { $0.lowercased() }).count
              if (subject.value.tags.count - uniqueCount) > 0 {
                  return noDuplicates.failing(rationale: "Found \(subject.value.tags.count - uniqueCount) duplicates")
              }
              return noDuplicates.passing()
          }
  
          // 4. Overall tag quality — groundedness, coverage, specificity
          ModelJudgeEvaluator(
              judge: .default,
              dimensions: [relevance, usefulness],
              prompt: ModelJudgePrompt(
                  instructions: """
                      You are evaluating automatically generated tags for Shelf, a personal
                      book tracking app. Users write a short summary of their reading
                      experience, and the app generates tags to make their library browsable.
                      A good tag describes the book itself — its genre, themes, tone, or
                      setting. A bad tag picks up incidental details or the reader's personal
                      reactions that don't describe the book.
                      """,
                  evaluationTarget: { output in output.tags.joined(separator: ", ") },
                  reference: { input, _ in
                      ["Expected Tags": input.expected?.tags.joined(separator: ", ") ?? ""]
                  }
              )
          )
      }

      // MARK: - Analysis
      func aggregateMetrics(using aggregator: inout MetricsAggregator) {
          aggregator.group("Heuristics") { group in
              group.computeMean(of: tagCount)
              group.computeMean(of: hasGenreTag)
              group.computeMean(of: noDuplicates)
          }
          aggregator.group("Quality") { group in
              group.computeMean(of: relevance.metric)
              group.computeMean(of: usefulness.metric)
          }
      }
  }

4:05 - Refined Relevance & Usefulness score dimensions

let relevance = ScoreDimension(
      "Relevance",
      description: """
          Whether each tag describes the book itself — its genre, themes,
          tone, or setting — rather than the reader's reactions, meta-
          commentary about the review, or facts about the author. A book
          can be "suspenseful" (a property of the text); a reader is
          "exhausted" (a reaction). Mis-labeling the genre is a serious failure.
          """,
      scale: .numeric([
          4: "Every tag describes the book itself",
          3: "Most tags describe the book, one picks up a reader reaction or minor detail",
          2: "Most tags are surface details or personal reactions, not book descriptors",
          1: "Tags don't meaningfully describe the book"
      ])
  )

  let usefulness = ScoreDimension(
      "Usefulness",
      description: """
          Whether tags work as library shelf labels — broad enough that
          several books could plausibly share the tag, specific enough to
          meaningfully narrow a search. Standard genre and theme tags work;
          made-up phrases, character names, hyper-specific descriptors, and
          overly generic words like "interesting" don't.
          """,
      scale: .numeric([
          4: "Every tag could group multiple books while still narrowing a search",
          3: "Most tags are at the right level, one is either too broad or too narrow",
          2: "Most tags are too broad to filter or too narrow to group",
          1: "Tags would not help with browsing"
      ])
  )

11:56 - The alignment dataset, extracted to JSON

// Model judge alignment dataset
  [
    {
      "input": "I have read this book more times than I can count…",
      "response": "[\"literary-fiction\", \"historical-fiction\", \"family-drama\", \"romantic-drama\", 
  \"character-driven\", \"emotional-intensity\", \"multigenerational-narrative\", \"penned-by-a-woman\"]"
    }
    // ... add your expert ratings to each entry
  ]

12:31 - The judge alignment evaluation: dataset, subject, evaluator

// Model judge alignment evaluation
  struct BookTagJudgmentCalibration: Evaluation {

      // MARK: Dataset — load the extracted summary/tag pairs
      static let samples: [ModelSample<BookTagJudgmentValue>] = {
          guard let url = Bundle(for: BundleToken.self).url(
                  forResource: "BookTaggingEvaluation-extracted", withExtension: "json"),
                let data = try? Data(contentsOf: url) else { return [] }
          // Build ModelSample array (adding expert ratings)
          // ...
      }()

      var dataset: some Loader { ArrayLoader(samples: Self.samples) }
  
      // MARK: Capture Subject — tags are already generated, so just return them
      func subject(from sample: ModelSample<BookTagJudgmentValue>) async throws -> ModelSubject<BookTagJudgmentValue> {
          ModelSubject(value: sample.expected ?? BookTagJudgmentValue(
              tags: [], expertRelevanceScore: 0, expertUsefulnessScore: 0))
      }

      // MARK: Evaluators — the same model judge as the book-tags evaluation
      var evaluators: Evaluators {
          ModelJudgeEvaluator(
              judge: .default,
              dimensions: [relevance, usefulness],
              prompt: ModelJudgePrompt(
                  instructions: "You are evaluating automatically generated tags for Book Tracker…",
                  evaluationTarget: { output in output.tags.joined(separator: ", ") },
                  reference: { input, _ in
                      ["Expected Tags": input.expected?.tags.joined(separator: ", ") ?? ""]
                  }
              )
          )
      }
  }

13:00 - Cohen's kappa aggregation

func aggregateMetrics(using aggregator: inout MetricsAggregator) {
      let expertRelevance = Self.samples.map { Double($0.expected?.expertRelevanceScore ?? 0) }
      let expertUsefulness = Self.samples.map { Double($0.expected?.expertUsefulnessScore ?? 0) }

      aggregator.group("Relevance") { group in
          group.computeMean(of: relevance.metric)
          group.computeStandardDeviation(of: relevance.metric)
          group.custom(of: relevance.metric, label: "Relevance Alignment Score") { judge in
              cohensKappa(ratings1: expertRelevance, ratings2: judge) ?? 0
          }
      }
      aggregator.group("Usefulness") { group in
          group.computeMean(of: usefulness.metric)
          group.computeStandardDeviation(of: usefulness.metric)
          group.custom(of: usefulness.metric, label: "Usefulness Alignment Score") { judge in
              cohensKappa(ratings1: expertUsefulness, ratings2: judge) ?? 0
          }
      }
  }

13:24 - The judge calibration test

// Model judge alignment tests
  @Suite("Book Tag Judge Calibration")
  struct BookTagJudgmentCalibrationTests {
      static let evaluation = BookTagJudgmentCalibration()

      @Test("Judge Calibration", .evaluates(evaluation))
      func evaluateJudgeCalibration() async throws {
          let result = EvaluationContext.current.result

          let usefulnessMetric = BookTagJudgmentCalibrationTests.evaluation.usefulness.metric
          let relevanceMetric = BookTagJudgmentCalibrationTests.evaluation.relevance.metric

          #expect(result.aggregateValue(.custom(label: "Relevance: Judge vs Expert")) > 0.6)
          #expect(result.aggregateValue(.custom(label: "Usefulness: Judge vs Expert")) > 0.6)
      }
  }

16:33 - The experimental judge prompt

// Experimental evaluation
  struct BookTagJudgmentCalibrationExperimental: Evaluation {
      var evaluators: Evaluators {
          ModelJudgeEvaluator(
              judge: .default,
              dimensions: [relevance, usefulness],
              prompt: ModelJudgePrompt(
                  instructions: """
                      You are an experienced reader and librarian evaluating tags
                      automatically generated for Book Tracker... Score the tag set on two
                      independent dimensions: Relevance and Usefulness.

                      ## What a good tag looks like
                      - Genre/form, theme/subject, tone/atmosphere, setting/era

                      ## Common failure modes
                      - Reader reactions, meta-commentary, author facts, genre contradictions
                      """,   // ← full prompt is ~40 lines; abbreviated here
                  evaluationTarget: { output in output.tags.joined(separator: ", ") },
                  reference: { input, _ in
                      ["Book Review": input.promptDescription,
                       "Tags Generated for the Review": input.expected?.tags.joined(separator: ", ") ?? ""]
                  }
              )
          )
      }
  }

20:12 - Few-shot worked examples in the judge prompt

struct ExperimentalBookTagJudgmentCalibration: Evaluation {
      var evaluators: Evaluators {
          ModelJudgeEvaluator(
              judge: SystemLanguageModel(),
              dimensions: [relevance, usefulness],
              prompt: ModelJudgePrompt(
                  instructions: """
                      You are calibrating with an expert librarian who scores
                      automatically generated tags for Book Tracker... Your goal is to
                      match how the librarian scores. Use the worked examples to calibrate.

                      ## Worked examples
                      ### Example A — clean fit (Pride and Prejudice)
                      Tags: romance, historical-fiction, love, redemption, passion
                      Librarian: Relevance 4, Usefulness 4

                      ### Example E — flat genre contradiction (Frankenstein)
                      Tags: horror, science-fiction, ... self-help, self-improvement
                      Librarian: Relevance 2, Usefulness 3
                      ... (6 examples A–F; keep the set small to avoid overfitting)
                      """,   // ← full prompt is ~60 lines; abbreviated here
                  evaluationTarget: { output in output.tags.joined(separator: ", ") },
                  reference: { input, _ in
                      ["Book Review": input.promptDescription,
                       "Tags Generated for the Review": input.expected?.tags.joined(separator: ", ") ?? ""]
                  }
              )
          )
      }
  }

  9. The BookLookupTool — slides 166–167

22:03 - The BookLookupTool

// Book Information Lookup Tool
  struct BookLookupTool: Tool {
      let name = "lookupBook"
      let description = "Looks up the title and author of a book given distinguishing details — such as character names, 
  settings, quoted lines, or notable plot points — extracted from a reader's review."

      @Generable
      struct Arguments {
          @Guide(description: "Distinguishing details from the review that identify the book, such as character names, 
  settings, quoted lines, or notable plot points.")
          var details: String
      }
  
      @Generable
      struct Output {
          @Guide(description: "The title of the identified book, or an empty string if no match was found.")
          var title: String

          @Guide(description: "The author of the identified book, or an empty string if no match was found.")
          var author: String
      }
  
      func call(arguments: Arguments) async throws -> Output {
          let needles = arguments.details
              .lowercased()
              .split(whereSeparator: { !$0.isLetter && !$0.isNumber })
              .map(String.init)
              .filter { $0.count >= 4 }

          let best = Book.sampleBooks
              .map { book -> (book: Book, score: Int) in
                  let review = book.review.lowercased()
                  let score = needles.reduce(0) { partial, needle in
                      partial + (review.contains(needle) ? 1 : 0)
                  }
                  return (book, score)
              }
              .max(by: { $0.score < $1.score })

          guard let match = best, match.score > 0 else {
              return Output(title: "", author: "")
          }
          return Output(title: match.book.title, author: match.book.author)
      }
  }

22:36 - BookTaggingService with a tools parameter

// Book Tagging Service
  struct BookTaggingService {
      static func generateTags(for review: String, tools: [any Tool] = []) async throws -> BookTags {
          let prompt = tagsPrompt(review: review)
          let session = LanguageModelSession(
              model: SystemLanguageModel(guardrails: .permissiveContentTransformations),
              tools: tools,
              instructions: instructions
          )
          let response = try await session.respond(to: prompt, generating: BookTags.self)
          return response.content
      }
  }

22:57 - Evaluation with the lookup tool

// Evaluation of tags with tool
  struct BookTaggingWithLookupEvaluation: Evaluation {
      func subject(from sample: ModelSample<BookTags>) async throws -> ModelSubject<BookTags> {
          let result = try await BookTaggingService.generateTags(
              for: sample.promptDescription,
              tools: [BookLookupTool()]
          )
          return ModelSubject(value: result)
      }
      // ... same dataset, evaluators, and aggregation as BookTaggingEvaluation
  }

23:09 - Compare with/without the tool in one suite

@Suite("Book Tag Evaluations")
  struct BookTagEvaluationTests {
      static let evaluation = BookTaggingEvaluation()
      static let lookupEvaluation = BookTaggingWithLookupEvaluation()

      @Test("Book Tag Evaluations", .evaluates(evaluation, info: evaluationInfo))
      func evaluateBookTagging() async throws {
          let result = EvaluationContext.current.result
          let rangeMetric = BookTagEvaluationTests.evaluation.tagCount
          let dupeMetric = BookTagEvaluationTests.evaluation.noDuplicates
          #expect(result.aggregateValue(.mean(of: rangeMetric)) >= 0.8)
          #expect(result.aggregateValue(.mean(of: dupeMetric)) == 1)
      }

      @Test("Book Tag Evaluations (with BookLookupTool)", .evaluates(lookupEvaluation, info: lookupEvaluationInfo))
      func evaluateBookTaggingWithLookup() async throws {
          let result = EvaluationContext.current.result
          let rangeMetric = BookTagEvaluationTests.lookupEvaluation.tagCount
          let dupeMetric = BookTagEvaluationTests.lookupEvaluation.noDuplicates
          #expect(result.aggregateValue(.mean(of: rangeMetric)) >= 0.8)
          #expect(result.aggregateValue(.mean(of: dupeMetric)) == 1)
      }
  }

- 0:00 - Introduction
- Hill-climbing — iteratively improving an intelligence feature using evaluation scores as a guide (develop, run, analyze) — framed around bringing scientific thinking to that loop. Assumes you've already built an evaluation pipeline (see "Meet the Evaluations framework").
- 2:42 - BookTracker's tagging problem
- Revisits BookTracker, whose tag generator produces tags that miss key themes or reflect the reader's feelings rather than the book. The existing evaluation judges tag quality via score dimensions (Relevance, Usefulness) and a ModelJudgeEvaluator.
- 5:27 - Analyzing the evaluation results
- Adds two reviews to the dataset, runs the evaluation (Swift Testing #expect), and uses the Xcode evaluation report and assistant editor to compare generated tags against expected ones, revealing the human and model judge disagree on usefulness.
- 8:26 - Drift between judge and human
- That disagreement is drift, the divergence between a model judge's ratings and an expert's. As the dataset grows, drift widens, making it hard to trust the evaluation, so the judge must be aligned to expert opinion.
- 9:37 - Measuring drift with Cohen's kappa
- Accuracy alone misleads on unevenly-distributed scores (a high-scoring judge looks aligned by luck). Cohen's kappa coefficient measures true alignment by subtracting the chance of random agreement from accuracy and normalizing, a robust drift metric.
- 12:26 - Building a judge alignment evaluation
- Builds an evaluation comparing the presenter's ratings to the judge's over a shared dataset: extract summary/tag pairs from the prior run's attachment, add human ratings, reuse the same ModelJudgeEvaluator as subject, and aggregate Cohen's kappa (plus mean and standard deviation), targeting an alignment of 0.6.
- 15:16 - Analyzing alignment failures
- The alignment test fails. Drilling into the report (for example Frankenstein, The Ramakien) shows the judge rating overly-specific or off-theme tags too highly, the judge's prompt lacks the context to tell a good tag from a bad one.
- 17:16 - Comparative evaluation: control vs experimental
- Xcode 27 can compare two evaluations like a controlled experiment: a baseline (control) prompt versus an experimental prompt that adds app context plus examples of good and bad tags. Running both shows relevance improved while usefulness dropped, a tradeoff to weigh.
- 19:12 - Refining the scoring dimensions
- Keeping the prompt change, the side-by-side comparison view reveals the judge grading usefulness too harshly. Applying the new prompt to the baseline to isolate one variable, the ScoreDimension descriptions are sharpened (emphasizing genre tags; being critical of overly-specific ones), improving both scores.
- 21:23 - Adding few-shot examples to the judge
- Still short of the goal, the judge prompt is grounded with the feature's purpose and a few worked examples of how the presenter rates, deliberately few to avoid overfitting the alignment score. Scores finally exceed expectations, so the judge is trusted and the loop exits.
- 23:38 - Going beyond prompts: adding a tool
- Hill-climbing isn't only prompts: to give the on-device tag model more context, a BookLookupTool supplies the title and author. BookTaggingService gains a tools parameter (defaulting empty), and a second evaluation compares the feature with versus without the tool, the tool version scores better, though the small 13-sample dataset and unobserved tool calls point to "Create robust evaluations for agentic apps."
- 27:17 - Next steps
- Think like a scientist (one change at a time), invest the time (failed experiments still inform), be creative (instructions, tools, models, datasets, aggregations, and evaluators are all fair game), and watch for drift. Download the Book Tracker sample and review the documentation.

探索“入门汇总”

及时了解最新动态

探索“平台”

精选

探索“技术”

精选

探索“社区”

精选

探索“文档”

发布说明

探索“下载”

精选

探索“支持”

精选

快速链接

章节

资源