Evaluationsを活用した山登り法によるプロンプトの改善

Evaluationsを活用した山登り法によるプロンプトの改善

プロンプトエンジニアリングのガイドを提供し、アプリに最適なモデルを選択できるようにする比較評価手法を紹介します。パフォーマンスのベースラインの設定や評価戦略の拡張を行い、評価結果をJSONに変換して他のツールと統合する方法を学びましょう。さまざまなプロンプト戦略を適用するタイミングや、最良の結果を得るためにプロンプトを反復によって調整する方法についても解説します。

関連する章
- 0:00 - Introduction
- 2:42 - BookTracker's tagging problem
- 5:27 - Analyzing the evaluation results
- 8:26 - Drift between judge and human
- 9:37 - Measuring drift with Cohen's kappa
- 12:26 - Building a judge alignment evaluation
- 15:16 - Analyzing alignment failures
- 17:16 - Comparative evaluation: control vs experimental
- 19:12 - Refining the scoring dimensions
- 21:23 - Adding few-shot examples to the judge
- 23:38 - Going beyond prompts: adding a tool
- 27:17 - Next steps
リソース
私はMarcusといいます Evaluationフレームワークチームのマネージャーです。 Evaluationの使い方をインテリジェンス機能の改善に活かす方法をご紹介します。ご存知のようにアプリにAIを活用することはユーザーに新たなレベルのパーソナライズを提供する強力な方法です。この技術はアプリに従来のソフトウェアでは実現できなかった深みをもたらします。しかしインテリジェンス機能がすべての場面で期待通りに動作するかどうかは把握しにくいという課題もあります。そこで、Evaluationフレームワークをリリースします。自信を持ってリリースするためのツールを提供するためです。自信を持ってリリースするにはフレームワークだけでは不十分です。 Evaluationフレームワークではヒルクライミングも行えます。これは反復的に改善していく Evaluationのスコアをガイドとして機能品質を高めるプロセスです。ヒルクライミングは開発から始まります。既存機能と比較して測定したい変更を加えることです。変更が完了したら Evaluationを実行します。結果が期待を満たしているか確認します。次に結果を分析して機能をさらに改善できるポイントを把握します。ヒルクライミングのプロセスを活用することは機能を体系的に改善する優れた方法です。しかし効果的なヒルクライミングにはループを回すだけでは足りません。少しばかりの… サイエンスも必要です! このビデオではヒルクライミングループに沿ってプロンプトを改善する方法をヒルクライミングループに沿いながら科学的な思考を取り入れながらご説明します。次に比較Evaluationの実施方法をご紹介します。ヒルクライミングのプロセスをより簡単にするためです。そして最後に、プロンプトの変更を超えてインテリジェンス機能の他の側面を改善する方法を扱います。先に進む前にこのビデオは既存のEvaluationをヒルクライミングするプロセスについてです。つまり、Evaluationパイプラインの基盤をすでに構築済みであることを前提とします。これにより、インテリジェンス機能の強みと弱みを総合的に把握できます。その方法に馴染みがない場合は別のビデオをご覧ください。 "Meet the Evaluations framework" 優れたEvaluationパイプラインを構築するために必要なことをすべてカバーしています。それでは始めましょう。 "Meet the Evaluations framework"で Book Trackerをご紹介しました。 Book Trackerは読者が本をカタログ化してレビューできるアプリです。
最近、私はクラシック作品をたくさん読んでいてそれらをカタログに追加しました。実は"Treasure Island"を読み終えたところです! 忠誠と裏切りの葛藤を描いたとても考えさせられる作品です。 Book Trackerの新機能の1つがタグ付けサービスです。読者のレビューをもとにモデルがタグを生成します。このレビューのタグは本の全体的なテーマをカバーしていますが何かが足りない気がします。 "tense"や"morally grey"のようなタグを期待していました。物語のテーマを表すものです。 "Little Women"のタグにも同様の問題がありました。 "poignant"のようなタグは読者の感情に関するもので本の内容とは別の話です。レビューに感情が込められるのはいいことですがそれがタグのリストに入るべきではありません。また"quiet-steadiness"のようなタグはレビューから直接引用されたもので後でライブラリを検索するときにあまり役立ちません。 Book Trackerのタグ生成機能は期待するレベルに達していないようです。幸い、同僚がこの機能を作ったときに Evaluationも作成しておりタグを一連の基準で測定できます。こちらがBook Trackerのブックタグ用Evaluationです。特にタグの品質をどのように評価しているかが気になるのでスクロールして確認します。アプリの定性的な側面は score dimensionsタイプで把握します。 Relevanceはタグが本のプロット情報をどれだけ表しているかを追跡します。本のプロットテーマ、またはその他の関連情報についてです。 Usefulnessはタグが検索語としてどれだけ優れているかを測定します。
ModelJudgeEvaluatorは score dimensionsとプロンプトを使って各タグセットのスコアを生成します。この2冊をEvaluationに追加して返ってくるタグを確認する予定です。モデルジャッジの評価と自分の評価を比較する良い機会にもなります。インテリジェンス機能の一部を改善したいと思うことがヒルクライミングプロセスの出発点です。ループを開始するには開発フェーズから始めます。ここで機能とEvaluationに必要な変更を加えます。この場合 "Treasure Island"のレビューを Evaluationデータセットに追加します。 "Little Women"についても同様にしました。この2つのエントリをデータセットに追加したので Evaluationを実行してモデルが生成するタグとジャッジのスコアを確認します。ただし、そのためにまず実行したEvaluationが期待を満たしているか確認します。
念のためお伝えすると期待値はSwift Testingの expectマクロで定義できます。テストが通るかどうかで期待値が満たされているか確認できます。
この場合、Evaluationはすべての期待値を満たしましたがタグが十分でないことはわかっているのでさらに調査が必要です。そこで分析フェーズに移ります。 Xcodeの新しいEvaluationレポートで直近のEvaluation実行の詳細情報を確認できます。詳細を確認するには BookTaggingEvaluationの実行をクリックします。 Evaluationの詳細ビューが表示されます。上部には集計メトリクスのチャートがあります。下には結果のテーブルがあります。
ここでモデルのレスポンスとあらかじめ作成した期待タグのリストを比較します。 Assistant Editorを開いて確認できます。データセットの各アイテムに対して生成されたタグの詳細情報を確認できます。モデルが生成したものと期待していたものの違いに注目します。このテーブルで詳しく確認できます。用語のコレクション自体は悪くありませんが重要な詳細がいくつか欠けています。ユーザーが検索したいようなストーリーの詳細です。そのため、これらのタグには relevanceで4点をつけます。 usefulnessは2点です。モデルジャッジもrelevanceに 4点をつけたのは良かったのですが usefulnessにも4点をつけておりこれは正しくありません。 "Little Women"のレビューのタグについても同じように感じるか確認します。タグには期待していた有用な情報がすべて含まれていません。こちらでもジャッジのスコアに同意できません。やはりrelevanceは4点 usefulnessは2点が妥当と思います。この分析から明らかになったのはモデルジャッジと私のタグ評価に乖離があることです。モデルと人間のこの乖離はドリフトと呼ばれておりインテリジェント機能を評価しようとするすべてのデベロッパが直面する課題です。その理由を説明します。 10サンプルのEvaluationがあるとします。モデルジャッジと人間に各サンプルを評価してもらいます。モデルと人間はそれぞれ 1から4のスケールで評価し最後にそのスコアを平均して集計値を作成します。モデルと人間の評価が一致しない傾向があると平均スコアが乖離しますこれがドリフトという名前の由来です。データセットが増え続けるとドリフトはどんどん広がります。そうなると機能が適切に評価されているかどうか判断しにくくなります。
これを解消するためにジャッジを専門家の意見に合わせることができます。ドリフトが問題だとわかったのでモデルジャッジが専門家の評価からどれだけ乖離しているかを測定する方法が必要です。 1つの方法は専門家の評価を並べて 2つが一致する箇所をマークすることです。これをパーセンテージとして表すことができます。このパーセンテージを accuracyといいます。スコアリングスケールのすべての値が均等に出現する場合は一致度を測る優れた方法です。しかしデータセットにはスコアの分布が偏っている値が含まれることが多いです。考えてみると、データセットには高品質な出力のサンプルが多く含まれることがあります。そのため、人間の評価者はデータセット内のアイテムを高いスコアで評価しがちです。モデルが小さなデータセットを高いスコアで評価した場合高得点をつけると 2つが一致しているように見えます。しかしより大きなデータセットに適用したときスコアのばらつきが増えると高スコアをつける傾向がドリフトを引き起こします。そのためaccuracyに代わる指標が必要です。
データセットの重み付けの性質とモデルが偶然に正解する可能性を考慮したものです。幸い解決策があります! Cohen's kappa係数は 1960年に統計学者・心理学者の Jacob Cohenが広めた数学的な式です。 Cohen's kappaは一致度を測定します。 2人の評価者がどれだけ一致するかです。そのためには、評価者が一致した時間の割合を知る必要があります。これがaccuracyです。これはまさに先ほどの accuracyメトリクスが計算していたものです。次に新しい値を計算します。偶然の一致（coincidence）は 1人の評価者が偶然一致する可能性を表します。この偶然性は特定の回答が出現しやすい確率に基づいて重み付けされます。ではどのように計算するのでしょうか。一致度を計算するには accuracyスコアから始めます。 accuracyスコアから 2人の評価者がランダムに一致する可能性を引きます。最後に、その差をランダムな一致の逆数つまり2人が意図的に一致した確率で割ります。その結果が一致度となります。 Cohen's kappaは一致度を測定する強力な方法です。モデルジャッジと専門家の意見の一致度を測ります。これを使って一致スコアをヒルクライミングできます。自分とモデルジャッジの間の一致度です。ヒルクライミングループの最初に戻り開発フェーズから始めます。そのために、Evaluationを設定して自分とジャッジの評価を比較し一致スコアを生成します。 Evaluationを記述する必要があります。 4つのコンポーネントで構成されます。最初はデータセットです。次にEvaluationの対象です。そして、Evaluatorを定義します。最後に、結果を集計します。データセットから始めましょう。このEvaluationを正しく機能させるにはモデルジャッジと私がまったく同じデータセットを評価する必要があります。この場合、モデルジャッジはタグをレビューするためジャッジと私がレビューするための共通のタグセットを用意する必要があります。ジャッジと私でレビューします。ちょうど良いデータセットがあります。先ほどのEvaluationにはレビューとタグのコレクションが含まれています。このEvaluationをテストで実行したため Xcodeがアタッチメントを生成しており生成されたすべてのEvaluationデータが含まれています。そのアタッチメントを取得してサマリーとタグのペアを抽出できます。サマリーとタグのペアを抽出したら自分の評価を追加します。その後このファイルの内容を Evaluationの入力として渡せます。
次にEvaluationの対象を設定する必要があります。通常、subjectメソッドは機能に関連するAPIを呼び出すためのものですが機能に関連するAPIを呼び出すものですが生成されたモデルのレスポンスがデータセットの一部なのですでに生成されたタグをそのまま返すことができます。次にEvaluatorを定義します。ご想像通り EvaluatorはBook タグEvaluationとまったく同じ ModelJudgeEvaluatorです。ここでジャッジが評価を提供します。最後に結果を集計します。自分の評価とジャッジの評価を比較する部分です。そのためにCohen's kappaを計算する必要があります。カスタム集計メソッドで実行できます。 Cohen's kappaに加えて平均も計算します。各score dimensionの標準偏差も計算します。ジャッジのスコアが上がっているか下がっているかを把握するのに役立ちます。テストでEvaluationを設定できます。このテストでは期待値を設定しており自分の評価とジャッジの評価が 0.6の一致スコアを出すことを期待しています。この数値を選んだのは統計学者によると 0.6の一致スコアは意義のある一致水準を表すためです。 Evaluationを実行して一致度のベースラインを取得します。そして、Evaluationが期待値を満たしているか確認します。テストが失敗したようです。期待値が満たされなかったということです。改めて結果を詳しく分析する時間です。一致スコアが期待値に達していないことがわかりました。 Evaluationレポートに移動して詳細を確認できます。予想通り、usefulnessと relevanceのスコアはかなり低くモデルジャッジと私の評価が一致していません。データセットの各サンプルがどのような結果になったか詳細情報を取得します。そのためにAssistant Editorを開いて結果を詳細に確認します。結果を確認していると "Frankenstein"のこのレビューが目に留まりました。タグに対する自分の評価とジャッジの評価にかなり大きな差があります。ジャッジはself-helpのようなタグが self-improvementがストーリーに関連していると判断しているようです。 psychologicalは許容できる検索語ですがユーザーが実際に検索しそうな用語ではないでしょう。次に同様の問題を抱えるデータセットの他のアイテムを確認したところ同様の問題があるアイテムを "The Ramakien"のこのレビューが見つかりました。ジャッジと私はこれらの用語のコレクションが役立ち本の内容に関連していることで一致しています。意見が分かれるのはusefulnessです。
"visual-dimension"や"quaint-dignity"のような用語は具体的すぎます。問題は何でしょうか。モデルが良いタグと悪いタグを区別するための十分な知識を単独では持っていないと思います。おそらくジャッジのプロンプトが十分なコンテキストを提供していないためです。そのため新しいプロンプトを開発する必要があります。現在のプロンプトの一致スコアと新しいプロンプトのスコアを比較できます。 Xcode 27では 2つのEvaluationの結果を互いに比較できるようになりました。比較を行う際には科学的な思考が大いに役立ちます。科学実験では2つのグループがあります。ベースラインを表すコントロールグループと実験グループです。比較しようとしている変化を表すものです。 2つのバージョンの指示についても同様に考えられます。コントロールグループはベースプロンプトを表します。実験グループは新たに変更したプロンプトを表します。実験的なプロンプトを使った 2番目のバージョンのEvaluationを作成する必要があります。ベースラインとして以前と同じモデルジャッジプロンプトを使った同じEvaluationを使用します。実験的なプロンプトとしてタグセットの評価方法についてより詳しい説明を書きました。まずジャッジにアプリのコンテキストと評価しようとしているものを説明します。次に良いタグの例を示します。悪いタグを識別する方法も示します。両方のプロンプトが完成したら両方のEvaluationをテストスイートに追加して両方を実行できます。そのスイートを今すぐ実行して結果を比較します。 Evaluationが完了したら Evaluationレポートに戻れます。 relevanceの一致スコアが改善されているようです。 usefulnessの一致スコアは大幅に低下しました。
このようなトレードオフのバランスを取るのは難しくどう進めるかを慎重に検討する必要があります。詳細な分析の前に合格したか確認します。テストで明らかなように合格していません。さらに検討した結果このプロンプトの変更を維持し次の反復ではusefulnessスコアの改善に集中します。そのため結果を確認する最も効果的な方法は 2つのジャッジのusefulnessスコアを互いに比較することです。 Evaluationレポートの新しい比較ビューを使用します。 Evaluationレポートから比較ボタンを開いてベースラインのEvaluationを開きます。 2つのプロンプトのスコアを並べて確認できます。すぐに気になったのは usefulnessスコアの差異です。 "Picture of Dorian Gray"のこのレビューに関するものです。モデルがusefulnessの評価を厳しくしすぎているようです。実験的なEvaluationのusefulnessの列がこの推測を裏付けているようです。すべてのスコアが 3か2のどちらかであることに気づきました。厳しすぎます。有効な対処法は各score dimensionの評価方法をより具体的にすることだと思います。そのため、実験的なEvaluationにいくつか変更を加えます。実験的なEvaluationを変更する前に実験的なEvaluationの新しいプロンプトをベースラインに適用しました。これにより変数が1つだけになります。具体的にはscore dimensionsの変更です。 relevanceには少し長い説明を加えてジャンルタグの必要性を強調しました。 usefulnessの説明です。過度に具体的なタグに対してより批判的であることを強調しています。再びEvaluationの実行を待ちます。
スコアはどちらもベースラインから大きく改善されました。これらの具体的な score dimensionsがはるかに役立ちそうです。しかし、まだ一致の目標に完全には達していません。もう一度比較を行って改善できる箇所を確認します。さらに分析するために実験的なEvaluationに戻りました。結果を詳細に確認したいので Assistant Editorを開きます。結果をたどると "Moby Dick"のレビューにたどり着きました。 relevanceスコアが一致し始めています。しかしusefulnessスコアはまだ改善の余地があります。有望な結果もありますがまだ大きくずれているものもあります。 "Frankenstein"のこのレビューは依然としてジャッジに問題を起こしています。今ジャッジに必要なのは私の評価パターンのサンプルです。私のスケールに合わせて評価するパターンを学習させます。つまりもう一度ヒルクライミングが必要です。新しいscore dimensionsはすでにベースラインのEvaluationに追加しました。次にメインのジャッジプロンプトを見直してタグ生成機能の目標についてより詳しい情報を加えました。モデルが問題領域を理解しやすくするためです。次にモデルが評価のガイドラインとして使ういくつかの例を書き出しました。モデルに渡す例の数は少なくしました。長いリストを与えると一致スコアが過学習しやすくジャッジが本当に自分と一致しているかどうか判断しにくくなります。これで公平な比較ができるので Evaluationを実行して結果を確認します。ついにスコアが期待値を超えました! ついに合格してループを抜け出すことができます! これでモデルジャッジが評価を行うとき自分の基準でタグが良いか悪いかを自信を持って判断できます。つまりジャッジを活用して Book TrackerのBook Tagging Serviceを評価できます。ここまでプロンプトをヒルクライミングする方法を見てきました。少しずつ改善していく方法です。
次にプロンプト以外の方法で機能を改善する方法をご紹介します。タグを生成するために Book Trackerはオンデバイスモデルを使用します。本をカタログ化するとき読者は様々な場所にいることが多いためオンデバイスモデルを使用することでどこにいてもタグを生成できます。モデルにタグを生成する本についてより多くのコンテキストを与えたいと思います。タグを生成する本についてです。追加コンテキストがあればモデルはより関連性の高い有用なタグを生成できると思います。さらに、Book Trackerはすでにこのために必要なデータを持っています。レビューを書いたときに著者名と書籍タイトルを保存しているためです。タグジェネレーターを支援するために本に関する追加情報を取得するためのツールを作成しました。利用可能であれば書籍タイトルと著者を提供します。このツールの追加はヒルクライミングの一形態です。機能の品質を改善しようとしているためです。段階的な変更によって実現します。このEvaluationでは改善されたモデルジャッジを使った Book Tagging Evaluationを使用します。ただし、ツールなしの機能の品質とツールありの品質を比較する方法が必要です。そのためBook Tagging Serviceに変更を加えます。 BookTaggingServiceはツールのリストを入力として受け取るようになりました。デフォルトを空の配列に設定したので既存のEvaluationを変更する必要はありません。ただし新しいEvaluationを書いてツールありのサービスとなしのサービスを比較する必要があります。こちらが作成した新しいEvaluationです。他のEvaluationとまったく同じです。唯一の違いは新しいlookupツールを toolsの配列に渡す点です。 Evaluationの2つのインスタンスを定義するだけです。 1つはツールなしもう1つはツールありです。では評価してリリースの準備ができているか確認します。ツールを使用するサービスはすべての期待値を満たしたので良い結果です。しかしBook Trackerのデータセットには 13組の本とレビューしかなくユーザーがタグ付けのために送信するかもしれない多様な本とレビューをカバーしていません。
加えてツールを使ったサービスの Evaluationの結果を確認していたところツールありのサービスの方がパフォーマンスが向上していますがツールが必要なすべての場面で呼び出されていないようです。ツールが適切な状況で呼び出されているかどうかを確認する方法が必要です。幸いEvaluationフレームワークはどちらの問題にも対応できます。ツール使用の評価と包括的なデータセットの生成に関する APIの詳細については "Create robust evaluations for agentic apps"ビデオをご覧ください。そこではツールコールEvaluatorについて Sample Generator APIの使い方をアプリが直面するさまざまなユースケースのテストに活用する方法を学べます。まとめる前に今日の内容を振り返ります。ヒルクライミングは一度に1つの変更に集中することで効果を発揮します。そのためにループの各反復を科学実験のように扱います。変更を分離できると機能の各部分が全体の品質にどう貢献するかを理解できます。各部分がどのように動作するかを知ることでどこを変更する必要があるかもわかります。後でバグや望ましくないパターンを解決する際に役立ちます。
2番目に、このプロセスには時間がかかります。加えた変更がすべて良い結果につながるわけではありません。しかし失敗した実験は成功した実験と同じくらい多くを教えてくれます。 3番目に、良い実験には創造性が必要です。インテリジェント機能では変更できることがたくさんあります。機能の中では指示ツールレスポンスを生成するために使用するモデルを変更できます。 Evaluation側ではデータセットを変更できます。集計メソッド Evaluator自体も変更できます。何でも試すことができます。ヒルクライミングの方法を考えるときはこれらすべてを検討してください。最後に、ドリフトに注意してください。 Evaluatorを評価するのは少しメタに感じるかもしれませんがよく調整されたモデルのEvaluatorは長期的に時間を節約してくれます。モデルは人間よりはるかに速く評価を生成できます。一致した状態を保つことでデータセットがより多くのユースケースをカバーするにつれて有用なシグナルが得られます。今日の内容についてさらに学びたい場合は使用したBook Trackerアプリやモデルジャッジを調整するための Evaluationを確認できます。すべての新しいAPIの包括的な概要はデベロッパドキュメントの Webサイトでご確認ください。ヒルクライミングで Evaluationスコアを向上させる方法を学んでいただきありがとうございます。皆さんの取り組みは必ず実を結びユーザーに高品質な体験を届けられるでしょう。ご視聴ありがとうございました良いヒルクライミングを!

// MARK: - Evaluation
  struct BookTaggingEvaluation: Evaluation {
      func subject(from sample: ModelSample<BookTags>) async throws -> ModelSubject<BookTags> {
          let result = try await BookTaggingService.generateTags(for: sample.promptDescription)
          return ModelSubject(value: result)
      }

      // MARK: - Dataset
      var dataset = ArrayLoader(samples:
          Book.sampleBooks.map { book in
              ModelSample(prompt: book.review, expected: BookTags(tags: book.tags))
          }
      )

      // MARK: - Evaluators & Metrics
      var tagCount = Metric("Tag Count")
      let hasGenreTag = Metric("Has Genre Tag")
      let noDuplicates = Metric("No Duplicates")

      let relevance = ScoreDimension(
          "Relevance",
          description: """
              Whether each tag describes a quality, theme, or tone of the
              book itself rather than incidental details or the reader's
              personal reactions.
              """,
          scale: .numeric([
              4: "Every tag describes the book itself",
              3: "Most tags describe the book, one picks up a reader reaction or minor detail",
              2: "Most tags are surface details or personal reactions, not book descriptors",
              1: "Tags don't meaningfully describe the book"
          ])
      )

      let usefulness = ScoreDimension(
          "Usefulness",
          description: """
              Whether tags are at the right granularity for browsing — broad
              enough that multiple books could share the tag, specific enough
              to help filter.
              """,
          scale: .numeric([
              4: "Every tag could group multiple books while still narrowing a search",
              3: "Most tags are at the right level, one is either too broad or too narrow",
              2: "Most tags are too broad to filter or too narrow to group",
              1: "Tags would not help with browsing"
          ])
      )

      var evaluators: Evaluators {
          // 1. Tag count is within the required 3–8 range
          Evaluator { _, subject in
              let count = subject.value.tags.count
              if (count >= 3 && count <= 8) {
                  return tagCount.passing(rationale: "\(count) tags")
              }
              return tagCount.failing(rationale: "Got \(count) tags, expected 3–8")
          }
  
          // 2. At least one tag identifies the genre or literary form
          Evaluator { _, subject in
              let tags = subject.value.tags.map { $0.lowercased() }
              let knownGenres = await BookTaggingService.knownGenres
              for tag in tags {
                  if knownGenres.contains(tag) {
                      return hasGenreTag.passing(rationale: "Matched \(tag)")
                  }
              }
              return hasGenreTag.failing()
          }

          // 3. No duplicate tags
          Evaluator { _, subject in
              let uniqueCount = Set(subject.value.tags.map { $0.lowercased() }).count
              if (subject.value.tags.count - uniqueCount) > 0 {
                  return noDuplicates.failing(rationale: "Found \(subject.value.tags.count - uniqueCount) duplicates")
              }
              return noDuplicates.passing()
          }
  
          // 4. Overall tag quality — groundedness, coverage, specificity
          ModelJudgeEvaluator(
              judge: .default,
              dimensions: [relevance, usefulness],
              prompt: ModelJudgePrompt(
                  instructions: """
                      You are evaluating automatically generated tags for Shelf, a personal
                      book tracking app. Users write a short summary of their reading
                      experience, and the app generates tags to make their library browsable.
                      A good tag describes the book itself — its genre, themes, tone, or
                      setting. A bad tag picks up incidental details or the reader's personal
                      reactions that don't describe the book.
                      """,
                  evaluationTarget: { output in output.tags.joined(separator: ", ") },
                  reference: { input, _ in
                      ["Expected Tags": input.expected?.tags.joined(separator: ", ") ?? ""]
                  }
              )
          )
      }

      // MARK: - Analysis
      func aggregateMetrics(using aggregator: inout MetricsAggregator) {
          aggregator.group("Heuristics") { group in
              group.computeMean(of: tagCount)
              group.computeMean(of: hasGenreTag)
              group.computeMean(of: noDuplicates)
          }
          aggregator.group("Quality") { group in
              group.computeMean(of: relevance.metric)
              group.computeMean(of: usefulness.metric)
          }
      }
  }

4:05 - Refined Relevance & Usefulness score dimensions

let relevance = ScoreDimension(
      "Relevance",
      description: """
          Whether each tag describes the book itself — its genre, themes,
          tone, or setting — rather than the reader's reactions, meta-
          commentary about the review, or facts about the author. A book
          can be "suspenseful" (a property of the text); a reader is
          "exhausted" (a reaction). Mis-labeling the genre is a serious failure.
          """,
      scale: .numeric([
          4: "Every tag describes the book itself",
          3: "Most tags describe the book, one picks up a reader reaction or minor detail",
          2: "Most tags are surface details or personal reactions, not book descriptors",
          1: "Tags don't meaningfully describe the book"
      ])
  )

  let usefulness = ScoreDimension(
      "Usefulness",
      description: """
          Whether tags work as library shelf labels — broad enough that
          several books could plausibly share the tag, specific enough to
          meaningfully narrow a search. Standard genre and theme tags work;
          made-up phrases, character names, hyper-specific descriptors, and
          overly generic words like "interesting" don't.
          """,
      scale: .numeric([
          4: "Every tag could group multiple books while still narrowing a search",
          3: "Most tags are at the right level, one is either too broad or too narrow",
          2: "Most tags are too broad to filter or too narrow to group",
          1: "Tags would not help with browsing"
      ])
  )

11:56 - The alignment dataset, extracted to JSON

// Model judge alignment dataset
  [
    {
      "input": "I have read this book more times than I can count…",
      "response": "[\"literary-fiction\", \"historical-fiction\", \"family-drama\", \"romantic-drama\", 
  \"character-driven\", \"emotional-intensity\", \"multigenerational-narrative\", \"penned-by-a-woman\"]"
    }
    // ... add your expert ratings to each entry
  ]

12:31 - The judge alignment evaluation: dataset, subject, evaluator

// Model judge alignment evaluation
  struct BookTagJudgmentCalibration: Evaluation {

      // MARK: Dataset — load the extracted summary/tag pairs
      static let samples: [ModelSample<BookTagJudgmentValue>] = {
          guard let url = Bundle(for: BundleToken.self).url(
                  forResource: "BookTaggingEvaluation-extracted", withExtension: "json"),
                let data = try? Data(contentsOf: url) else { return [] }
          // Build ModelSample array (adding expert ratings)
          // ...
      }()

      var dataset: some Loader { ArrayLoader(samples: Self.samples) }
  
      // MARK: Capture Subject — tags are already generated, so just return them
      func subject(from sample: ModelSample<BookTagJudgmentValue>) async throws -> ModelSubject<BookTagJudgmentValue> {
          ModelSubject(value: sample.expected ?? BookTagJudgmentValue(
              tags: [], expertRelevanceScore: 0, expertUsefulnessScore: 0))
      }

      // MARK: Evaluators — the same model judge as the book-tags evaluation
      var evaluators: Evaluators {
          ModelJudgeEvaluator(
              judge: .default,
              dimensions: [relevance, usefulness],
              prompt: ModelJudgePrompt(
                  instructions: "You are evaluating automatically generated tags for Book Tracker…",
                  evaluationTarget: { output in output.tags.joined(separator: ", ") },
                  reference: { input, _ in
                      ["Expected Tags": input.expected?.tags.joined(separator: ", ") ?? ""]
                  }
              )
          )
      }
  }

13:00 - Cohen's kappa aggregation

func aggregateMetrics(using aggregator: inout MetricsAggregator) {
      let expertRelevance = Self.samples.map { Double($0.expected?.expertRelevanceScore ?? 0) }
      let expertUsefulness = Self.samples.map { Double($0.expected?.expertUsefulnessScore ?? 0) }

      aggregator.group("Relevance") { group in
          group.computeMean(of: relevance.metric)
          group.computeStandardDeviation(of: relevance.metric)
          group.custom(of: relevance.metric, label: "Relevance Alignment Score") { judge in
              cohensKappa(ratings1: expertRelevance, ratings2: judge) ?? 0
          }
      }
      aggregator.group("Usefulness") { group in
          group.computeMean(of: usefulness.metric)
          group.computeStandardDeviation(of: usefulness.metric)
          group.custom(of: usefulness.metric, label: "Usefulness Alignment Score") { judge in
              cohensKappa(ratings1: expertUsefulness, ratings2: judge) ?? 0
          }
      }
  }

13:24 - The judge calibration test

// Model judge alignment tests
  @Suite("Book Tag Judge Calibration")
  struct BookTagJudgmentCalibrationTests {
      static let evaluation = BookTagJudgmentCalibration()

      @Test("Judge Calibration", .evaluates(evaluation))
      func evaluateJudgeCalibration() async throws {
          let result = EvaluationContext.current.result

          let usefulnessMetric = BookTagJudgmentCalibrationTests.evaluation.usefulness.metric
          let relevanceMetric = BookTagJudgmentCalibrationTests.evaluation.relevance.metric

          #expect(result.aggregateValue(.custom(label: "Relevance: Judge vs Expert")) > 0.6)
          #expect(result.aggregateValue(.custom(label: "Usefulness: Judge vs Expert")) > 0.6)
      }
  }

16:33 - The experimental judge prompt

// Experimental evaluation
  struct BookTagJudgmentCalibrationExperimental: Evaluation {
      var evaluators: Evaluators {
          ModelJudgeEvaluator(
              judge: .default,
              dimensions: [relevance, usefulness],
              prompt: ModelJudgePrompt(
                  instructions: """
                      You are an experienced reader and librarian evaluating tags
                      automatically generated for Book Tracker... Score the tag set on two
                      independent dimensions: Relevance and Usefulness.

                      ## What a good tag looks like
                      - Genre/form, theme/subject, tone/atmosphere, setting/era

                      ## Common failure modes
                      - Reader reactions, meta-commentary, author facts, genre contradictions
                      """,   // ← full prompt is ~40 lines; abbreviated here
                  evaluationTarget: { output in output.tags.joined(separator: ", ") },
                  reference: { input, _ in
                      ["Book Review": input.promptDescription,
                       "Tags Generated for the Review": input.expected?.tags.joined(separator: ", ") ?? ""]
                  }
              )
          )
      }
  }

20:12 - Few-shot worked examples in the judge prompt

struct ExperimentalBookTagJudgmentCalibration: Evaluation {
      var evaluators: Evaluators {
          ModelJudgeEvaluator(
              judge: SystemLanguageModel(),
              dimensions: [relevance, usefulness],
              prompt: ModelJudgePrompt(
                  instructions: """
                      You are calibrating with an expert librarian who scores
                      automatically generated tags for Book Tracker... Your goal is to
                      match how the librarian scores. Use the worked examples to calibrate.

                      ## Worked examples
                      ### Example A — clean fit (Pride and Prejudice)
                      Tags: romance, historical-fiction, love, redemption, passion
                      Librarian: Relevance 4, Usefulness 4

                      ### Example E — flat genre contradiction (Frankenstein)
                      Tags: horror, science-fiction, ... self-help, self-improvement
                      Librarian: Relevance 2, Usefulness 3
                      ... (6 examples A–F; keep the set small to avoid overfitting)
                      """,   // ← full prompt is ~60 lines; abbreviated here
                  evaluationTarget: { output in output.tags.joined(separator: ", ") },
                  reference: { input, _ in
                      ["Book Review": input.promptDescription,
                       "Tags Generated for the Review": input.expected?.tags.joined(separator: ", ") ?? ""]
                  }
              )
          )
      }
  }

  9. The BookLookupTool — slides 166–167

22:03 - The BookLookupTool

// Book Information Lookup Tool
  struct BookLookupTool: Tool {
      let name = "lookupBook"
      let description = "Looks up the title and author of a book given distinguishing details — such as character names, 
  settings, quoted lines, or notable plot points — extracted from a reader's review."

      @Generable
      struct Arguments {
          @Guide(description: "Distinguishing details from the review that identify the book, such as character names, 
  settings, quoted lines, or notable plot points.")
          var details: String
      }
  
      @Generable
      struct Output {
          @Guide(description: "The title of the identified book, or an empty string if no match was found.")
          var title: String

          @Guide(description: "The author of the identified book, or an empty string if no match was found.")
          var author: String
      }
  
      func call(arguments: Arguments) async throws -> Output {
          let needles = arguments.details
              .lowercased()
              .split(whereSeparator: { !$0.isLetter && !$0.isNumber })
              .map(String.init)
              .filter { $0.count >= 4 }

          let best = Book.sampleBooks
              .map { book -> (book: Book, score: Int) in
                  let review = book.review.lowercased()
                  let score = needles.reduce(0) { partial, needle in
                      partial + (review.contains(needle) ? 1 : 0)
                  }
                  return (book, score)
              }
              .max(by: { $0.score < $1.score })

          guard let match = best, match.score > 0 else {
              return Output(title: "", author: "")
          }
          return Output(title: match.book.title, author: match.book.author)
      }
  }

22:36 - BookTaggingService with a tools parameter

// Book Tagging Service
  struct BookTaggingService {
      static func generateTags(for review: String, tools: [any Tool] = []) async throws -> BookTags {
          let prompt = tagsPrompt(review: review)
          let session = LanguageModelSession(
              model: SystemLanguageModel(guardrails: .permissiveContentTransformations),
              tools: tools,
              instructions: instructions
          )
          let response = try await session.respond(to: prompt, generating: BookTags.self)
          return response.content
      }
  }

22:57 - Evaluation with the lookup tool

// Evaluation of tags with tool
  struct BookTaggingWithLookupEvaluation: Evaluation {
      func subject(from sample: ModelSample<BookTags>) async throws -> ModelSubject<BookTags> {
          let result = try await BookTaggingService.generateTags(
              for: sample.promptDescription,
              tools: [BookLookupTool()]
          )
          return ModelSubject(value: result)
      }
      // ... same dataset, evaluators, and aggregation as BookTaggingEvaluation
  }

23:09 - Compare with/without the tool in one suite

@Suite("Book Tag Evaluations")
  struct BookTagEvaluationTests {
      static let evaluation = BookTaggingEvaluation()
      static let lookupEvaluation = BookTaggingWithLookupEvaluation()

      @Test("Book Tag Evaluations", .evaluates(evaluation, info: evaluationInfo))
      func evaluateBookTagging() async throws {
          let result = EvaluationContext.current.result
          let rangeMetric = BookTagEvaluationTests.evaluation.tagCount
          let dupeMetric = BookTagEvaluationTests.evaluation.noDuplicates
          #expect(result.aggregateValue(.mean(of: rangeMetric)) >= 0.8)
          #expect(result.aggregateValue(.mean(of: dupeMetric)) == 1)
      }

      @Test("Book Tag Evaluations (with BookLookupTool)", .evaluates(lookupEvaluation, info: lookupEvaluationInfo))
      func evaluateBookTaggingWithLookup() async throws {
          let result = EvaluationContext.current.result
          let rangeMetric = BookTagEvaluationTests.lookupEvaluation.tagCount
          let dupeMetric = BookTagEvaluationTests.lookupEvaluation.noDuplicates
          #expect(result.aggregateValue(.mean(of: rangeMetric)) >= 0.8)
          #expect(result.aggregateValue(.mean(of: dupeMetric)) == 1)
      }
  }

- 0:00 - Introduction
- Hill-climbing — iteratively improving an intelligence feature using evaluation scores as a guide (develop, run, analyze) — framed around bringing scientific thinking to that loop. Assumes you've already built an evaluation pipeline (see "Meet the Evaluations framework").
- 2:42 - BookTracker's tagging problem
- Revisits BookTracker, whose tag generator produces tags that miss key themes or reflect the reader's feelings rather than the book. The existing evaluation judges tag quality via score dimensions (Relevance, Usefulness) and a ModelJudgeEvaluator.
- 5:27 - Analyzing the evaluation results
- Adds two reviews to the dataset, runs the evaluation (Swift Testing #expect), and uses the Xcode evaluation report and assistant editor to compare generated tags against expected ones, revealing the human and model judge disagree on usefulness.
- 8:26 - Drift between judge and human
- That disagreement is drift, the divergence between a model judge's ratings and an expert's. As the dataset grows, drift widens, making it hard to trust the evaluation, so the judge must be aligned to expert opinion.
- 9:37 - Measuring drift with Cohen's kappa
- Accuracy alone misleads on unevenly-distributed scores (a high-scoring judge looks aligned by luck). Cohen's kappa coefficient measures true alignment by subtracting the chance of random agreement from accuracy and normalizing, a robust drift metric.
- 12:26 - Building a judge alignment evaluation
- Builds an evaluation comparing the presenter's ratings to the judge's over a shared dataset: extract summary/tag pairs from the prior run's attachment, add human ratings, reuse the same ModelJudgeEvaluator as subject, and aggregate Cohen's kappa (plus mean and standard deviation), targeting an alignment of 0.6.
- 15:16 - Analyzing alignment failures
- The alignment test fails. Drilling into the report (for example Frankenstein, The Ramakien) shows the judge rating overly-specific or off-theme tags too highly, the judge's prompt lacks the context to tell a good tag from a bad one.
- 17:16 - Comparative evaluation: control vs experimental
- Xcode 27 can compare two evaluations like a controlled experiment: a baseline (control) prompt versus an experimental prompt that adds app context plus examples of good and bad tags. Running both shows relevance improved while usefulness dropped, a tradeoff to weigh.
- 19:12 - Refining the scoring dimensions
- Keeping the prompt change, the side-by-side comparison view reveals the judge grading usefulness too harshly. Applying the new prompt to the baseline to isolate one variable, the ScoreDimension descriptions are sharpened (emphasizing genre tags; being critical of overly-specific ones), improving both scores.
- 21:23 - Adding few-shot examples to the judge
- Still short of the goal, the judge prompt is grounded with the feature's purpose and a few worked examples of how the presenter rates, deliberately few to avoid overfitting the alignment score. Scores finally exceed expectations, so the judge is trusted and the loop exits.
- 23:38 - Going beyond prompts: adding a tool
- Hill-climbing isn't only prompts: to give the on-device tag model more context, a BookLookupTool supplies the title and author. BookTaggingService gains a tools parameter (defaulting empty), and a second evaluation compares the feature with versus without the tool, the tool version scores better, though the small 13-sample dataset and unobserved tool calls point to "Create robust evaluations for agentic apps."
- 27:17 - Next steps
- Think like a scientist (one change at a time), invest the time (failed experiments still inform), be creative (instructions, tools, models, datasets, aggregations, and evaluators are all fair game), and watch for drift. Download the Book Tracker sample and review the documentation.

「今すぐ始める」を詳しく見る

最新情報

プラットフォームを詳しく見る

特集

テクノロジーを詳しく見る

特集

コミュニティを詳しく見る

特集

ドキュメントを詳しく見る

リリースノート

ダウンロードを詳しく見る

特集

サポートを詳しく見る

特集

クイックリンク

関連する章

リソース