iOS 26 FoundationModels: Comprehensive Swift/SwiftUI Reference

A complete guide to Apple's on-device language model framework — from availability checking through guided generation, tool calling, and production patterns

28 February 202675 min read
AI-Ometer
AI-authored90%
 

Overview

FoundationModels is Apple's framework for accessing the on-device large language model that powers Apple Intelligence. Introduced at WWDC 2025, it gives apps direct access to the same model behind Writing Tools, Smart Replies, and Mail Summaries — running entirely on-device, with no network requests and no data leaving the device.

Key characteristics:

  • On-device only — no cloud fallback, no API key, no latency from network round-trips
  • Privacy-first — all inference happens locally; Apple never sees your prompts or responses
  • Availability-gated — requires Apple Intelligence to be enabled; not all devices qualify
  • iOS 26+ only — requires iPhone 15 Pro / iPhone 15 Pro Max or later (or equivalent iPad)
  • Shared resource — the model serves all apps; system may rate-limit under load

What it excels at:

  • Text correction, normalisation, and reformatting
  • Entity extraction and classification
  • Summarisation of short-to-medium content
  • Structured output generation (via guided generation)
  • Context-aware suggestions and completions

What it is not:

  • A replacement for frontier models (GPT-4, Claude, Gemini) for complex reasoning
  • A cloud API — if the model is unavailable, there is no fallback infrastructure
  • A general-purpose search or retrieval system

Minimum requirements:

  • iOS 26.0+, iPadOS 26.0+, macOS Tahoe 26.0+
  • Xcode 26.0+
  • Device must support Apple Intelligence (iPhone 15 Pro or later)
  • Apple Intelligence must be enabled in Settings

Contents

  1. Availability & Setup
    SystemLanguageModel, availability cases, AnyObject? pattern
  2. Sessions & Basic Prompting
    LanguageModelSession, Instructions, Prompt, .content gotcha
  3. Prompt Engineering for On-Device Models
    What works, what doesn't, #Playground
  4. Guided Generation (@Generable)
    @Guide, constraints, PartiallyGenerated
  5. Streaming
    streamResponse(), ResponseStream, .collect()
  6. Generation Options
    temperature, SamplingMode, maximumResponseTokens
  7. Tool Calling
    Tool protocol, pre-fetch vs inject, context cost
  8. Token Budget
    tokenUsage(for:), contextSize, overflow strategies
  9. The Transcript
    Transcript.Entry, saving/resuming sessions
  10. Failure Modes & Graceful Degradation
    GenerationError, never-throws pattern
  11. Testing
    Four test categories, .disabled() on-device tests
  12. Example Use Cases
    10 concrete patterns across app domains
  13. Quick Reference & Anti-Patterns
    Cheatsheet + 10 things not to do
  14. Context Engineering
    Select/inject/compress/pre-summarise
  15. Advanced Patterns
    Actor isolation, @Generable enums with associated values, Observable monitoring, PromptRepresentable chaining, bounded domain injection

Part 1: Availability & Setup

SystemLanguageModel.default

SystemLanguageModel.default is the singleton entry point for the on-device language model. You do not initialise it — it is a static property you reference directly. Everything in FoundationModels starts here.

let model = SystemLanguageModel.default

switch model.availability {
case .available:
    // model is ready — create a session and run prompts
case .unavailable(let reason):
    // handle the specific reason
@unknown default:
    break
}

SystemLanguageModel is an Observable final class, so you can observe .availability changes in SwiftUI via @State or inside .task {} blocks without any special wiring.


Availability Cases

SystemLanguageModel.Availability is a @frozen enum with two top-level cases: .available and .unavailable(UnavailableReason). Always handle @unknown default — Apple will add cases in future OS versions.

.available

The model is downloaded, Apple Intelligence is enabled, and the device is eligible. Create a LanguageModelSession and proceed.

.unavailable(.deviceNotEligible)

The hardware does not support Apple Intelligence. This applies to iPhone 14 and earlier, and equivalent iPad/Mac models. This is permanent for the lifetime of the device — no amount of waiting or retrying will change it. When you see this case, remove the AI code path from your UI entirely and show a permanent alternative experience.

.unavailable(.appleIntelligenceNotEnabled)

The device is eligible but the user has not turned on Apple Intelligence in Settings > Apple Intelligence & Siri. This is a user choice, not a hardware limitation. You can optionally prompt the user to enable it:

// Optionally deep-link to Settings
if let url = URL(string: UIApplication.openSettingsURLString) {
    UIApplication.shared.open(url)
}

Respect the user's decision. If they choose not to enable it, show the non-AI path without nagging.

.unavailable(.modelNotReady)

This is the most misunderstood case. It does not mean the model is permanently unavailable — it means the model weights are currently downloading. There is no programmatic download API. You cannot trigger the download, request it, or track its progress. The OS manages download timing based on network conditions, battery level, device temperature, and system load. Download can take minutes to hours.

Treat .modelNotReady as a transient state. Do not show a permanent "not supported" message. Instead, show a softer "not available right now — check back later" state and retry on the next app launch or session.

func checkAvailability() -> String {
    switch SystemLanguageModel.default.availability {
    case .available:
        return "Ready"
    case .unavailable(let reason):
        switch reason {
        case .deviceNotEligible:
            return "Device not supported"
        case .appleIntelligenceNotEnabled:
            return "Enable Apple Intelligence in Settings"
        case .modelNotReady:
            return "Downloading... check back soon"
        @unknown default:
            return "Unavailable"
        }
    @unknown default:
        return "Unknown"
    }
}

isAvailable Convenience Property

SystemLanguageModel.default.isAvailable is a Bool shorthand. Use it when you only need to gate a code path and don't need to distinguish between unavailability reasons:

guard SystemLanguageModel.default.isAvailable else { return }
// proceed with AI code path

If you need to communicate why the model is unavailable to the user, use the full .availability switch instead.


UseCase.general vs .contentTagging

SystemLanguageModel.UseCase selects a specialised version of the model. There are two options:

SystemLanguageModel.UseCase.general (the default for SystemLanguageModel.default) — a general-purpose model for writing assistance, analysis, correction, extraction, and summarisation. This is what you get when you use SystemLanguageModel.default.

SystemLanguageModel.UseCase.contentTagging — specialised for classification and extraction tasks. When you use this model, it always responds with tags — it is tuned to identify topics, emotions, actions, and objects. Use this when you want to categorise or label content rather than transform or generate it.

// General model (default — used for most tasks)
let model = SystemLanguageModel.default

// Content tagging model — for classification/extraction
let taggingModel = SystemLanguageModel(useCase: .contentTagging)
let session = LanguageModelSession(model: taggingModel)

Do not use .contentTagging for text correction or generation tasks. The model will produce tags rather than prose, regardless of your instructions.


Guardrails

SystemLanguageModel.Guardrails controls content safety filtering on model inputs and outputs. There are two presets:

SystemLanguageModel.Guardrails.default — the standard setting. Blocks unsafe content in both prompts and responses. When triggered, throws LanguageModelSession.GenerationError.guardrailViolation(_:).

SystemLanguageModel.Guardrails.permissiveContentTransformations — allows potentially sensitive source material to pass through for string generation tasks. Use this when your app legitimately processes user-generated content that might incidentally contain sensitive words (e.g., a chat moderation tool, a study app covering difficult topics). This mode only applies to string output — guided generation (@Generable) always uses default guardrails.

// Default guardrails (most apps)
let model = SystemLanguageModel.default

// Permissive mode — for apps that must process sensitive source text
let model = SystemLanguageModel(guardrails: .permissiveContentTransformations)

Even in permissive mode, the model may still refuse certain content — it retains its own layer of safety separate from the guardrail system.


The AnyObject? Pattern for SwiftUI

iOS 26 types require @available(iOS 26, *) annotations. Annotating a @State property with @available propagates that constraint to the entire containing view struct — meaning the whole view requires iOS 26, which is likely not what you want.

The solution is to store iOS 26-only service instances as AnyObject? and cast them back inside #available guards:

// DON'T do this — @available propagates to the whole view
@available(iOS 26, *)
@State private var service: MyAI26Service?  // ❌ forces view to require iOS 26

// DO this instead — no @available constraint on the view struct
@State private var service: AnyObject?      // ✅ clean — AnyObject has no availability

// Store the service (inside a #available guard)
if #available(iOS 26, *) {
    self.service = MyAI26Service()
}

// Use the service (inside a #available guard)
if #available(iOS 26, *),
   let s = self.service as? MyAI26Service {
    let result = try await s.process(text)
}

This pattern lets you write a single view struct that gracefully degrades on older OS versions without any @available annotation on the view itself.


Part 2: Sessions & Basic Prompting

LanguageModelSession — Init Variants

LanguageModelSession is the object you interact with to send prompts and receive responses. The two most common init patterns are:

Fresh session (most common):

// With builder-style instructions
let session = LanguageModelSession {
    "You are a BJJ terminology corrector."
    "Fix misrecognised terms to their canonical spellings."
}

// With a specific model
let session = LanguageModelSession(model: SystemLanguageModel.default) {
    "You are a motivational coach."
}

// With string instructions — also valid
let session = LanguageModelSession(
    instructions: "You are a code review assistant."
)

Resume from transcript (multi-turn):

// Rehydrate a session from a saved transcript to continue a conversation
let session = LanguageModelSession(
    model: SystemLanguageModel.default,
    tools: [],
    transcript: savedTranscript
)

The session is an Observable final class. It is also Sendable, so you can safely hold a reference from a @MainActor context and call its methods from async tasks.


Instructions

Instructions defines the model's persona, rules, and domain — what the model is and how it behaves. Set it once at session creation. Instructions apply to every prompt in that session.

Use @InstructionsBuilder (result builder syntax) to compose instructions from multiple strings:

let instructions = Instructions {
    "You are a BJJ terminology corrector."
    "Fix misrecognised BJJ terms to their canonical spellings."
    "Common corrections: kimora→Kimura, half card→Half Guard, darce→D'Arce"
}

Or pass a plain String directly:

let session = LanguageModelSession(
    instructions: "You are a concise summariser. Respond in three sentences maximum."
)

Instructions are not the user's question — that is the Prompt. Instructions define the container; the prompt fills it.

The framework injects instructions as the system-level context for the model. The model follows instructions at higher priority than prompt content, so put your constraints and rules in instructions, not in the prompt itself.


Prompt

Prompt is the user's input — the actual question, text, or content you want the model to process. Use @PromptBuilder for dynamic construction:

// Builder style — for dynamic prompts
let prompt = Prompt {
    "Correct this transcript: \(rawText)"
}

// String literal — also valid
let response = try await session.respond(to: "Summarise the following: \(article)")

Prompt strings accept string interpolation. Keep prompts concise — every token in a prompt consumes context budget that competes with the response.


The Critical .content Gotcha

respond(to:options:) returns LanguageModelSession.Response<T>, not T directly. Response<T> is a wrapper struct. The actual generated value is at .content.

This is the single most common mistake when first using the framework:

// WRONG — response is Response<String>, not String
let text = try await session.respond(to: prompt)
print(text.uppercased())  // compile error: Response<String> has no uppercased()

// RIGHT
let response = try await session.respond(to: prompt)
let text = response.content  // String
print(text.uppercased())

// With typed guided generation
let response = try await session.respond(
    to: prompt,
    generating: MyOutputType.self
)
let value = response.content  // MyOutputType

Internalise this: respond() always returns Response<T>. Always unwrap .content before using the value.


Response.rawContent

response.rawContent gives you the unprocessed GeneratedContent before guided generation parsing. This is the raw structured output the model produced, before it was decoded into your @Generable type. Use it for debugging when a response fails to parse or produces unexpected values — it shows you exactly what the model generated.


Session-Per-Call vs Persistent Sessions

This is a key architectural decision. Get it right at design time.

Session-per-call — create a new LanguageModelSession for each request. No conversation history accumulates. This is the correct pattern for the vast majority of use cases: text correction, extraction, summarisation, classification, entity detection. Each request is independent.

// Session-per-call — correct for stateless tasks
func normalise(_ text: String) async throws -> String {
    let session = LanguageModelSession {
        "Fix speech-to-text errors in BJJ transcripts."
        "Corrections: kimora→Kimura, half card→Half Guard, darce→D'Arce"
    }
    let response = try await session.respond(to: Prompt { text })
    return response.content
}

Persistent session — keep the LanguageModelSession alive across multiple respond() calls. The session accumulates its Transcript as you go, so the model remembers previous exchanges. Use this only when the model needs that history to answer correctly — for example, a coaching chatbot where the user refers to something they said three turns ago.

// Persistent session — for multi-turn conversation
@Observable
class ChatAssistant {
    private let session = LanguageModelSession {
        "You are a BJJ coach assistant."
        "Help the user analyse and improve their game based on their training logs."
    }

    func chat(_ message: String) async throws -> String {
        let response = try await session.respond(to: Prompt { message })
        return response.content  // transcript accumulates automatically
    }
}

The risk with persistent sessions: the transcript grows with each exchange and eventually hits the context window limit, throwing LanguageModelSession.GenerationError.exceededContextWindowSize. For long-running conversations, you need a strategy for trimming or summarising history. Session-per-call has no such risk.

Default to session-per-call. Only reach for persistent sessions when you have a concrete requirement for cross-turn memory.


Part 3: Prompt Engineering for On-Device Models

The On-Device Model is Smaller — This Changes Everything

The model powering FoundationModels is Apple’s private on-device LLM — not GPT-4, not Claude, not Gemini. It is significantly smaller (estimated ~3B parameters) than frontier cloud models. This is a feature, not a bug — it runs entirely on your device with sub-second latency — but it fundamentally changes how you should write prompts.

Techniques that work reliably on frontier models can actively degrade performance on the on-device model. Treat every prompt engineering heuristic you have learned from cloud models as a starting point to validate, not a rule to apply.


Principle 1: Short, Direct Instructions

Keep instructions under approximately 200 words total. Longer instructions dilute the signal — the model struggles to prioritise which parts matter most and may partially ignore sections buried deep in a long system prompt.

Every sentence in your instructions should earn its place. If you can remove a sentence without changing the model’s behaviour, remove it.

// WEAK — verbose, repetitive
let session = LanguageModelSession {
    "You are a helpful assistant specialising in Brazilian Jiu-Jitsu."
    "Your primary purpose is to help users with BJJ-related queries."
    "When you see text from speech recognition, carefully examine it."
    "Your goal is to correct any speech recognition errors in the text."
    "Please make sure to handle common BJJ terminology correctly."
}

// STRONG — dense, direct
let session = LanguageModelSession {
    "Fix speech-to-text errors in BJJ transcripts."
    "Correct misrecognised terms. Return only the corrected text."
}

Principle 2: Explicit Corrections Beat Implied Inference

If you have known domain-specific misrecognitions or corrections, list them explicitly. Do not rely on the model inferring what “fix BJJ terms” means — it may not know the canonical spellings for niche vocabulary.

// WEAK — relies on the model knowing BJJ terminology
"Fix any incorrectly transcribed Brazilian Jiu-Jitsu terminology."

// STRONG — explicit correction table
let session = LanguageModelSession {
    "Fix speech-to-text errors in BJJ transcripts."
    "Common misrecognitions: kimora/kimura -> Kimura, half card/half god -> Half Guard,"
    "darce/dart -> D'Arce, rnc/arnc -> Rear Naked Choke, omoa plata -> Omoplata."
}

The on-device model does not have the deep BJJ domain knowledge that a frontier model trained on vast internet corpora might have. Make your domain knowledge explicit in the prompt rather than hoping the model already knows it.


Principle 3: Include a Domain Vocabulary in Instructions

For niche domains — BJJ, medicine, legal, finance, specialised engineering — include a vocabulary list or canonical term glossary in your instructions. This gives the model the reference it needs to make correct corrections or use correct terminology in its output.

let session = LanguageModelSession {
    "You are a BJJ transcript corrector."
    "Canonical terms: Guard, Half Guard, Mount, Back Mount, Side Control,"
    "North-South, Turtle, Closed Guard, Open Guard, De La Riva, X-Guard,"
    "Kimura, Armbar, Triangle, Rear Naked Choke, D'Arce, Anaconda,"
    "Omoplata, Heel Hook, Kneebar, Toe Hold."
    "Correct misrecognised terms to their canonical forms."
}

This is more token-efficient than hoping for inference, and significantly more reliable.


Principle 4: One Task Per Session

Do not ask the model to perform multiple distinct tasks in one session. Correction AND summarisation AND extraction in a single prompt will produce worse results on the on-device model than running them as separate sessions.

// WEAK — three tasks in one call
let response = try await session.respond(to: Prompt {
    "Correct BJJ terms, summarise the session, and extract techniques used."
    rawText
})

// STRONG — one focused task per session
let corrected = try await correctSession.respond(to: Prompt { rawText })
let summary = try await summarySession.respond(to: Prompt { corrected.content })
let techniques = try await extractSession.respond(to: Prompt { corrected.content })

The overhead of running multiple sessions is minimal compared to the reliability gain from focused, single-task prompts.


Principle 5: Avoid Chain-of-Thought Prompting

"Think step by step", "Let’s reason through this", and similar chain-of-thought prompts improve performance on large models but add noise on smaller on-device models. The model produces reasoning tokens that consume context budget without materially improving the final answer — and can sometimes cause the model to talk itself into a worse answer.

Do not use CoT prompting for on-device tasks. Give direct instructions and ask for direct output.

// WEAK — chain-of-thought on a small model
"Think step by step about what BJJ terms might have been misrecognised, then correct them."

// STRONG — direct instruction
"Correct misrecognised BJJ terms. Return only the corrected text."

Frontier Model vs On-Device: Comparison

TechniqueFrontier ModelOn-Device Model
Chain-of-thought promptingWorks well ✅Degrades performance ❌
Long, elaborate instructionsFine ✅Unreliable ⚠️
Implicit domain inferenceOften works ✅Unreliable for niche domains ⚠️
Explicit correction listsHelpful ✅Critical ✅✅
Multi-task instructionsUsually works ✅Fails ❌
Short, direct instructionsWorks ✅Works best ✅✅
CoT / "think step by step"Major boost ✅Noise and overhead ❌
Few-shot examples in promptWorks ✅Works, watch token budget ⚠️

The #Playground Macro — Fast Prompt Iteration

Available from iOS 26.4+ (February 2026 Foundation Models update), the #Playground macro lets you iterate on prompts directly in Xcode without building and running the full app. Write a #Playground block in a Swift file, run it from the Xcode canvas, and see the response inline.

When you run the canvas, the output shows Input Token Count and Response Token Count separately — useful for understanding your prompt’s cost against the ~4,096 token context window estimate shown in canvas.

import FoundationModels

#Playground {
    let session = LanguageModelSession {
        "Fix BJJ transcript errors."
        "kimora -> Kimura, half card -> Half Guard, darce -> D'Arce"
    }
    let response = try await session.respond(
        to: "worked kimora from half card today, finished with darce"
    )
    response.content  // displayed in Xcode canvas
}

This is the fastest feedback loop for prompt engineering. Iterate on your instructions in the playground before wiring them into the app. Test with the exact on-device model, not a frontier proxy — behaviour differs significantly, and a prompt that works on GPT-4 may not work well on the Apple on-device model.


Part 4: Guided Generation (@Generable)

What @Generable Does

@Generable is an attached macro that synthesises Generable protocol conformance on a struct or enum. At compile time it does three things:

  1. Generates a PartiallyGenerated associated type — a mirror of the struct where every stored property is Optional. This is the type you receive when iterating a stream mid-generation.
  2. Infers a JSON schema from the struct's property types and any @Guide annotations. That schema drives constrained sampling, which guarantees the output is always structurally valid — no parsing, no runtime crashes from malformed responses.
  3. Synthesises ConvertibleFromGeneratedContent and ConvertibleToGeneratedContent conformances, which handle encoding and decoding between the model's internal representation and your Swift type.

The model generates properties in the order they are declared, so put properties that should influence later ones first.

Basic Usage

@Generable
struct BookReview {
    var title: String
    var rating: Int
    var summary: String
}

let session = LanguageModelSession()
let response = try await session.respond(
    to: "Review this book: \(bookTitle)",
    generating: BookReview.self
)
let review = response.content  // BookReview — fully populated, no parsing needed

@Guide — Descriptions

@Guide(description:) tells the model what a property means. Include descriptions for any property where the name alone is ambiguous. Keep them concise — long descriptions consume context and add latency.

@Generable
struct NormalisedTranscript {
    @Guide(description: "The full transcript with BJJ terms corrected and properly cased")
    var normalisedText: String

    @Guide(description: "BJJ terms found in the transcript, each in canonical form e.g. 'Kimura', 'Half Guard'")
    var extractedTerms: [String]
}

You can also annotate the struct itself via @Generable(description:):

@Generable(description: "A classified support ticket with priority and routing metadata")
struct TicketClassification {
    @Guide(description: "Urgency level for routing decisions")
    var priority: Int
}

@Guide — Constraints with GenerationGuide

@Guide also accepts one or more GenerationGuide<T> values to enforce numeric bounds and array sizes. All bounds are inclusive.

@Generable
struct ProductReview {
    @Guide(description: "Star rating", .range(1...5))
    var rating: Int

    @Guide(description: "Key selling points, at most three", .maximumCount(3))
    var keyPoints: [String]

    @Guide(description: "Topics addressed, at least one", .minimumCount(1))
    var topics: [String]

    @Guide(description: "Quality score", .minimum(0), .maximum(100))
    var qualityScore: Double
}

Available GenerationGuide constraints:

ConstraintApplies ToBehaviour
.range(n...m)Numeric typesValue must fall within the closed range (inclusive both ends)
.minimum(n)Numeric typesValue must be ≥ n
.maximum(n)Numeric typesValue must be ≤ n
.minimumCount(n)[T] arraysArray must contain ≥ n elements
.maximumCount(n)[T] arraysArray must contain ≤ n elements

Multiple guides can be combined on a single property as variadic arguments — .minimum(0), .maximum(100) is valid.

Enums as @Generable Types

Mark enums with @Generable to use them as property types inside other @Generable structs. The constrained sampler restricts output to valid case names only:

@Generable
enum Sentiment {
    case positive
    case neutral
    case negative
}

@Generable
struct MessageClassification {
    @Guide(description: "Overall tone of the message")
    var sentiment: Sentiment

    @Guide(description: "Urgency, 1 = routine, 5 = escalate immediately", .range(1...5))
    var urgency: Int
}

Enums with associated values are also supported — the @Generable macro ensures all associated and nested values are themselves generable.

PartiallyGenerated — Streaming Snapshots

Every @Generable type gets a synthesised PartiallyGenerated associated type. It is a version of the struct where all stored properties are Optional, representing work-in-progress output during streaming:

for try await snapshot in session.streamResponse(
    to: "Review: \(bookTitle)",
    generating: BookReview.self
) {
    let partial = snapshot.content  // BookReview.PartiallyGenerated
    // partial.title might be "The G..." while still generating
    // partial.rating is nil until the model has written that property
    if let title = partial.title {
        titleLabel.text = title
    }
}
// After the loop completes, collect() gives a Response<BookReview> with all properties set

PartiallyGenerated is a streaming-only concern. When you call respond() (non-streaming), you receive the completed Content type directly — no optionals, no partial states to handle.

GeneratedContent — Untyped Escape Hatch

GeneratedContent is the framework's internal structured representation of model output. You normally never interact with it — @Generable handles encoding and decoding automatically.

When you need raw access, every Response exposes:

let response = try await session.respond(to: prompt, generating: BookReview.self)
response.content     // BookReview — your typed result
response.rawContent  // GeneratedContent — the underlying parsed value

rawContent is useful for debugging when model output does not match your type. You can inspect it to see exactly what the model produced before your ConvertibleFromGeneratedContent init ran.

For fully dynamic schemas (where the type is not known at compile time), use respond(schema:) with a GenerationSchema built from DynamicGenerationSchema. The response will have Content == GeneratedContent, and you decode manually via value(_:forProperty:):

let response = try await session.respond(to: prompt, schema: schema)
let soup: String = try response.content.value(forProperty: "dailySoup")

Independent Constructability — Critical for Testing

@Generable types must be constructable via their memberwise initialiser without running the model. This is the property that makes them unit-testable:

// Your output type
@Generable
struct NormalisedTranscript {
    @Guide(description: "Corrected transcript text")
    var normalisedText: String

    @Guide(description: "Extracted BJJ terms in canonical form")
    var extractedTerms: [String]
}

// Tests run on any machine — no Apple Intelligence required
func testNormalisationOutputType() {
    let result = NormalisedTranscript(
        normalisedText: "Worked Kimura from Half Guard",
        extractedTerms: ["Kimura", "Half Guard"]
    )
    #expect(result.normalisedText.contains("Kimura"))
    #expect(result.extractedTerms.count == 2)
}

If your @Generable type has custom initialisers that depend on model output, or computed properties with side effects, you have broken this contract. Keep output types as plain data containers — structs with stored properties and no embedded behaviour.

Protocol Hierarchy

You rarely interact with these directly — @Generable wires everything up — but understanding the hierarchy helps when debugging conformance errors or writing manual implementations:

ProtocolRole
GenerableSynthesised by @Generable. Requires PartiallyGenerated associated type, generationSchema, and ConvertibleFromGeneratedContent init. Inherits from both Convertible* protocols.
ConvertibleFromGeneratedContentTypes constructable from model output. Int, String, Bool, Float, Double, Decimal, Array, enums, and @Generable structs all conform automatically.
ConvertibleToGeneratedContentTypes that can be serialised back to GeneratedContent. Used for tool output and prompt injection. Inherits from PromptRepresentable.
PromptRepresentableTypes that can appear inside a @PromptBuilder closure. @Generable types conform, so you can pass model output directly back as prompt input in a subsequent call.

Part 5: Streaming

The Core Decision: Stream or Not?

Use CaseMethodReason
Live text appearing for the user (typing effect)streamResponse()User sees progress, engagement increases
Processing output programmaticallyrespond()Simpler — no partial state handling
Background pipeline (normalisation, extraction)respond()No UI benefit; streaming increases rate-limit risk in background
Long-form generation the user is watchingstreamResponse()Progress feedback reduces perceived latency
Structured @Generable outputrespond() preferredPartial structs with all-Optional properties add complexity for no gain

Apple's own docs note that background tasks should use the non-streaming respond() to reduce the likelihood of encountering GenerationError.rateLimited errors.

String Streaming

let stream = session.streamResponse(to: "Summarise: \(text)")

for try await snapshot in stream {
    let partial: String = snapshot.content  // String grows with each chunk
    await MainActor.run {
        self.displayText = partial
    }
}

// Or skip the loop entirely and just collect the final result
let fullResponse = try await stream.collect()
let finalText = fullResponse.content  // String — complete

Typed (@Generable) Streaming

let stream = session.streamResponse(
    to: "Review: \(text)",
    generating: BookReview.self
)

for try await snapshot in stream {
    let partial = snapshot.content  // BookReview.PartiallyGenerated
    // All properties are Optional — may be nil while the model generates earlier properties
    if let title = partial.title {
        titleLabel.text = title
    }
    if let rating = partial.rating {
        updateStars(rating)
    }
}

// Collect to receive the complete, fully-typed result
let response = try await stream.collect()
let review = response.content  // BookReview — all properties non-nil

ResponseStream<Content>

streamResponse() returns a ResponseStream<Content>, which is an AsyncSequence of ResponseStream.Snapshot<Content> values. The type parameter matches what you would get from the equivalent respond() call.

// Type relationships
session.streamResponse(to: prompt)
// → ResponseStream<String>

session.streamResponse(to: prompt, generating: BookReview.self)
// → ResponseStream<BookReview>

// Each snapshot during the stream:
snapshot.content
// → String (for string stream)
// → BookReview.PartiallyGenerated (for typed stream — all properties Optional)

// After .collect():
response.content
// → String (complete)
// → BookReview (complete, all properties set)

ResponseStream<Content> conforms to AsyncSequence, so you get the full suite of async sequence operators — map, filter, prefix, etc.

Progressive UI Update Pattern

The natural pattern for SwiftUI is to assign each snapshot directly to a @State property:

@Observable
final class SummaryViewModel {
    var generatedText = ""

    func generate(prompt: Prompt) async throws {
        let stream = session.streamResponse(to: prompt)
        for try await snapshot in stream {
            generatedText = snapshot.content  // @Observable triggers view update per chunk
        }
    }
}

// In the view:
Text(viewModel.generatedText)
    .animation(.default, value: viewModel.generatedText)

For @Generable types, update individual UI elements as their backing properties become available:

for try await snapshot in stream {
    let partial = snapshot.content  // BookReview.PartiallyGenerated
    titleLabel.text = partial.title ?? titleLabel.text  // retain last known value
    summaryLabel.text = partial.summary ?? summaryLabel.text
}

collect() — Streams to Full Response

collect() is an async method on ResponseStream that waits for the stream to finish and returns a complete Response<Content>:

let stream = session.streamResponse(to: prompt)

// Option A: observe snapshots AND get the final result
for try await snapshot in stream {
    updateProgressUI(snapshot.content)
}
// Stream is exhausted — collect() returns immediately since the stream is done
let finalResponse = try await stream.collect()

// Option B: skip observation, just get the final result
let finalResponse = try await stream.collect()

If the stream finished with an error before collect() is called, collect() propagates that error. If the stream completed successfully, collect() returns immediately with the cached result.

Error Handling in Streams

Errors are thrown during iteration, not at stream creation (the stream object itself is always returned, even if the model will fail):

do {
    for try await snapshot in stream {
        // process snapshot
    }
} catch LanguageModelSession.GenerationError.rateLimited(let retryAfter) {
    // system under load — retry after the given delay
} catch LanguageModelSession.GenerationError.exceededContextWindowSize {
    // prompt + history too long — trim the input
} catch LanguageModelSession.GenerationError.guardrailViolation {
    // content flagged — show alternative UX
} catch {
    // unexpected error
}

The same error types apply to streamResponse() as to respond() — the difference is only in when they surface during your async call.


Part 6: Generation Options

GenerationOptions is a struct you pass to respond() or streamResponse() to control how the model generates output. All properties are optional — omitting them leaves the model at its defaults, which are usually correct.

let options = GenerationOptions(
    temperature: 0.1,
    maximumResponseTokens: 200
)

let response = try await session.respond(to: prompt, options: options)

temperature

Temperature controls how "creative" or "random" the model's output is, on a scale from 0.0 to 1.0. nil (the default) lets the model use its own calibrated default, which is appropriate for most tasks.

TemperatureBehaviourBest For
nilModel default (typically ~0.7)General use — let the model decide
0.0–0.2Near-deterministic, consistentCorrections, extraction, classification
0.3–0.6BalancedSummarisation, analysis
0.7–1.0Creative, variedBrainstorming, dialogue, story generation

The most common mistake: setting a high temperature for a correction or extraction task. If you are normalising speech-to-text errors using @Generable, you want the model to produce the same correct answer every time — not a creatively varied interpretation. Use nil or set it low.

// ❌ High temperature for a structured correction task — produces inconsistent output
let options = GenerationOptions(temperature: 0.9)
let response = try await session.respond(
    to: Prompt { rawTranscript },
    generating: NormalisedTranscript.self,
    options: options
)

// ✅ Low temperature — deterministic, reliable corrections
let options = GenerationOptions(temperature: 0.1)
// or just omit options entirely — the schema constraints already reduce variance

For @Generable output, the constrained sampler that enforces your schema also reduces variance regardless of temperature. But setting temperature low is still good practice to signal your intent and produce maximally consistent output.

GenerationOptions.SamplingMode

SamplingMode gives you control over the underlying sampling algorithm. The two modes are:

.greedy — always selects the single most probable token at each step. Maximally deterministic. Best for tasks with one correct answer (grammar correction, structured extraction).

.random(temperature:) — samples from the probability distribution, with temperature scaling how broadly. This is the mode behind the temperature parameter.

// Explicit greedy sampling — maximum determinism
let options = GenerationOptions(
    sampling: .greedy
)

// Random sampling at a specific temperature
let options = GenerationOptions(
    sampling: .random(temperature: 0.7)
)

The temperature property on GenerationOptions is a convenience shorthand for .random(temperature:). Setting temperature: 0.0 is equivalent to .greedy.

maximumResponseTokens

Sets an upper bound on how many tokens the model can generate in its response. Useful for:

  • Capping costs (on-device, this is latency rather than money) when you know responses should be short
  • Preventing runaway generation in summary tasks where you want concise output
  • Enforcing length constraints the instructions alone can't reliably enforce
// Limit to a short summary (~100 tokens ≈ ~75 words)
let options = GenerationOptions(maximumResponseTokens: 100)

let response = try await session.respond(
    to: "Summarise this training session in one paragraph: \(notes)",
    options: options
)

Be careful not to set maximumResponseTokens too low for @Generable types — if the model runs out of tokens before completing your struct, it will throw GenerationError.exceededContextWindowSize.


Part 7: Tool Calling

Tools let the model call back into your Swift code to fetch data or perform actions during generation. The model autonomously decides whether and when to call a tool — you provide the definitions; it decides whether they are relevant to the current prompt.

The Tool Protocol

Conform to Tool to define a callable function the model can invoke:

@available(iOS 26, *)
struct CurrentDateTool: Tool {
    let name = "getCurrentDate"
    let description = "Returns today's date in ISO 8601 format (YYYY-MM-DD)."

    // Arguments the model will pass — a @Generable struct
    @Generable
    struct Arguments {
        @Guide(description: "Optional timezone identifier, e.g. 'Europe/Dublin'")
        var timezone: String?
    }

    // Return type — any PromptRepresentable (String is simplest)
    func call(arguments: Arguments) async -> String {
        let formatter = ISO8601DateFormatter()
        if let tz = arguments.timezone,
           let zone = TimeZone(identifier: tz) {
            formatter.timeZone = zone
        }
        return formatter.string(from: Date())
    }
}

Key constraints:

  • Arguments must conform to ConvertibleFromGeneratedContent. A @Generable struct is the standard approach — the macro handles conformance automatically.
  • Output (the return type) must conform to PromptRepresentable. String always works. @Generable types also work.
  • call(arguments:) is implicitly @concurrent — it runs off the main actor. Make it async if you need to do async work.

Registering Tools With a Session

Pass tools in the tools parameter when creating a session:

@available(iOS 26, *)
let session = LanguageModelSession(
    tools: [CurrentDateTool(), UserProfileTool()]
) {
    "You are a task scheduling assistant."
    "Use getCurrentDate to determine today's date before scheduling."
}

let response = try await session.respond(
    to: "Schedule a reminder for two weeks from today"
)
let text = response.content  // model called getCurrentDate internally

The model receives each tool's name, description, and the JSON schema derived from Arguments. It uses the name and description to decide whether calling the tool is relevant to the prompt. Name and description are the primary signals — write them as short, specific phrases.

How the Model Decides to Call Tools

You cannot force the model to call a specific tool. It decides autonomously based on:

  1. Whether the tool's name and description match the intent of the prompt
  2. Whether it already has the information it needs without a tool call
  3. Whether the prompt semantically requires external data

The model may call zero tools (if it can answer from its knowledge), call one tool, or call multiple tools before producing its final response.

Critical Performance Insight: Pre-Fetch vs Tool

This is Apple's own guidance from the documentation, and it matters for performance:

If you ALWAYS need data from a source, inject it directly into instructions rather than defining a tool.

// ❌ Tool for data you always need — adds latency on every call
struct UserPreferencesTool: Tool { ... }

// ✅ Pre-fetch and inject — one fetch, zero tool overhead
let preferences = await loadUserPreferences()
let session = LanguageModelSession {
    "User preferences: \(preferences.serialised)"
    "Use these preferences when making recommendations."
}

Tools have two costs:

  1. Token cost — each tool definition (name + description + arguments schema) consumes context budget. A tool with a complex Arguments struct can cost 50–100 tokens just for its definition.
  2. Latency cost — each tool call is a model inference round-trip: the model generates a call, your code runs, the result is injected back, the model continues. This adds meaningful latency.

Reserve tools for data that is conditionally needed — data you might need depending on what the user asks.

Context Window Cost

Define tools concisely. The model sees name + description + arguments schema for every tool, every call, whether it uses them or not.

// ❌ Verbose tool definition — each call consumes more context
struct FetchUserTrainingHistoryForTheLastSixMonthsTool: Tool {
    let name = "fetchUserTrainingHistoryForTheLastSixMonths"
    let description = "This tool fetches the complete training history of the current user for the past six calendar months, including all session notes, techniques practised, and time spent..."
    // ...
}

// ✅ Concise — same capability, fraction of the tokens
struct TrainingHistoryTool: Tool {
    let name = "getTrainingHistory"
    let description = "Returns recent training sessions with notes and techniques."
    // ...
}

A practical limit is 3–5 tools per session. Beyond that, the definitions alone consume a significant portion of context, leaving less room for the actual conversation.

Tool Calls in the Transcript

When the model calls a tool, it appears in the session's Transcript as two entries:

  • Transcript.Entry.toolCalls — the model's request(s) to call tools
  • Transcript.Entry.toolOutput — the results that were injected back

This is useful when debugging why the model produced a particular response — you can inspect the transcript to see exactly what tool calls were made and what data the model received. See Part 9 (The Transcript) for full Transcript coverage.


Part 8: Token Budget

The on-device model has a fixed context window shared by all inputs and outputs for a session. Understanding how that budget is consumed is essential for building reliable features — especially multi-turn conversations and tool-using sessions.

The Budget Breakdown

Every token in a session competes for the same fixed window:

Total Context Window
├── Instructions (system prompt)
├── Tool definitions (name + description + args schema × number of tools)
├── Transcript history (all previous turns)
├── Current prompt
└── Response (tokens generated)

Response tokens are not free — they come out of the same pool as input. A long system prompt and a long conversation history leave less room for both the current prompt and its response.

Measuring Token Usage

SystemLanguageModel exposes three tokenUsage(for:) overloads (added February 2026):

let model = SystemLanguageModel.default

// 1. Cost of Instructions + tool definitions
let instrUsage = try await model.tokenUsage(
    for: instructions,
    tools: [MyTool()]
)
print(instrUsage.tokenCount)  // e.g. 180

// 2. Cost of a single Prompt
let promptUsage = try await model.tokenUsage(for: prompt)
print(promptUsage.tokenCount)  // e.g. 45

// 3. Cost of a saved Transcript (conversation history)
let historyUsage = try await model.tokenUsage(for: transcript.entries)
print(historyUsage.tokenCount)  // e.g. 620

All three return SystemLanguageModel.TokenUsage, with a single tokenCount: Int property. Use these to profile your sessions during development rather than guessing.

The contextSize Property

SystemLanguageModel.contextSize returns the total context window size in tokens as an async Int. It is back-deployed to earlier OS versions via @backDeployed:

let totalWindow = await SystemLanguageModel.default.contextSize
// e.g. 4096

let available = totalWindow - instrUsage.tokenCount - historyUsage.tokenCount
print("Available for prompt + response: \(available) tokens")

Use contextSize to compute headroom before sending a prompt, particularly in multi-turn sessions where history accumulates.

GenerationError.exceededContextWindowSize

This error is thrown when the combined input (instructions + tools + history + prompt) exceeds the context window. Handle it gracefully:

do {
    let response = try await session.respond(to: prompt)
} catch LanguageModelSession.GenerationError.exceededContextWindowSize {
    // Strategies:
    // 1. Summarise the conversation history and start a new session
    // 2. Trim the oldest transcript entries
    // 3. Remove tool definitions you don't strictly need
    // 4. Shorten the prompt
}

For multi-turn sessions, the most robust strategy is to detect when history is growing long and summarise it before continuing:

// When history exceeds a threshold, compress it
if historyTokenCount > contextSize / 2 {
    let summary = try await summariseHistory(session.transcript)
    // Start fresh session with summary in instructions
    session = LanguageModelSession {
        "Previous conversation summary: \(summary)"
    }
}

The #Playground Macro for Budget Profiling

The #Playground macro in Xcode (26.4+) shows Input Token Count and Response Token Count separately in the canvas after each run. This is the fastest way to profile token usage during development — no logging, no instrumentation, just iterate on the prompt and watch the counts update in real time.

Rules of Thumb

ContentApproximate Token Cost
1 word~1.3 tokens
100 words~130 tokens
1 page (250 words)~325 tokens
Simple @Generable struct (2 props)~50 tokens overhead
Tool definition (name + description + args)~50–100 tokens
Default context window~4,096 tokens

A 4k window sounds large but fills up quickly in multi-turn sessions with tool-heavy prompts.


Part 9: The Transcript

Transcript is the linear record of everything that has happened in a LanguageModelSession. Every turn adds entries. The transcript is how the model "remembers" previous exchanges in a multi-turn conversation.

Transcript.Entry

The transcript is an array of Transcript.Entry values. Each entry is one of five cases:

EntryWhen It Appears
.instructions(Transcript.Instructions)Session creation — the system prompt
.prompt(Transcript.Prompt)Each time you call respond() or streamResponse()
.response(Transcript.Response)Each model reply
.toolCalls(Transcript.ToolCalls)When the model decides to invoke one or more tools
.toolOutput(Transcript.ToolOutput)The result(s) returned from your tool's call()

A simple two-turn conversation produces this entry sequence:

.instructions  ← session setup
.prompt        ← "What's the best sweep from Half Guard?"
.response      ← "The Hip Bump Sweep is..."
.prompt        ← "How do I set it up?"
.response      ← "Start by flattening your opponent..."

A tool-calling exchange adds two extra entries per tool call:

.prompt        ← "What techniques did I drill last Tuesday?"
.toolCalls     ← [getTrainingHistory(date: "2026-02-24")]
.toolOutput    ← [{ sessions: [...] }]
.response      ← "Last Tuesday you drilled..."

Reading the Transcript

Access the current session transcript via session.transcript:

let session = LanguageModelSession { "You are a BJJ coach." }
_ = try await session.respond(to: "What is the Kimura?")
_ = try await session.respond(to: "How do I finish it from Guard?")

// Inspect the transcript
for entry in session.transcript.entries {
    switch entry {
    case .prompt(let p):
        print("User: \(p.segments.map(\.description).joined())")
    case .response(let r):
        print("Model: \(r.segments.map(\.description).joined())")
    default:
        break
    }
}

Saving and Resuming Sessions

Save the transcript to persist a conversation and resume it later — useful for a coaching assistant where the user expects the model to remember what they discussed in previous sessions:

// Save
let savedTranscript = session.transcript
// Persist to SwiftData, UserDefaults, or disk...

// Resume — new session with full history
let resumedSession = LanguageModelSession(
    model: SystemLanguageModel.default,
    tools: [],
    transcript: savedTranscript
)
// Model now has full context of the previous conversation
let response = try await resumedSession.respond(to: "Where were we?")

The resumed session is identical in behaviour to a session that never stopped — the model sees the full entry history.

When to Use the Transcript

Use transcript accumulation when:

  • The model needs to refer back to something the user said earlier ("as I mentioned before...")
  • You are building a multi-turn chatbot or coaching assistant
  • Continuity across app sessions is a user-facing feature

Do NOT accumulate transcripts when:

  • Each call is independent (normalisation, extraction, summarisation, classification)
  • You are using session-per-call — there is no transcript to worry about
  • The task is stateless — the model does not need to "remember" anything

Unnecessary transcript accumulation wastes context budget and eventually causes GenerationError.exceededContextWindowSize. Most FoundationModels use cases do not need cross-turn memory — use session-per-call by default (see Part 2).


Part 10: Failure Modes & Graceful Degradation

FoundationModels can fail in ways that are different from a typical network API. Most failures are environmental (device eligibility, model state, system load) rather than logic errors. The right response in almost every case is graceful degradation, not throwing errors up to the UI.

GenerationError Cases

LanguageModelSession.GenerationError is thrown from respond() and streamResponse():

.exceededContextWindowSize

The combined input (instructions + tools + history + prompt) exceeded the context window. Solutions in order of preference:

  1. Reduce the prompt — summarise or truncate the input text
  2. Trim the oldest transcript entries in a multi-turn session
  3. Remove tool definitions that aren't needed for this call
  4. Split into multiple sessions

.rateLimited

The system is under load. The on-device model is a shared resource — all apps use the same model, and the OS rate-limits when demand is high. Handle with simple exponential backoff:

func generateWithRetry(session: LanguageModelSession, prompt: Prompt) async throws -> String {
    var delay: UInt64 = 1_000_000_000  // 1 second
    for attempt in 1...3 {
        do {
            return try await session.respond(to: prompt).content
        } catch LanguageModelSession.GenerationError.rateLimited {
            if attempt < 3 {
                try await Task.sleep(nanoseconds: delay)
                delay *= 2
            }
        }
    }
    throw LanguageModelSession.GenerationError.rateLimited  // re-throw after 3 attempts
}

.guardrailViolation

The content triggered safety filtering. This can happen on the prompt (the input was flagged) or on the response (the model started generating something that triggered the filter). The error contains context on what was flagged.

.unsupportedGuide

A @Guide constraint on a @Generable type is not supported for the current model or OS version. This should not occur in production if your deployment target is correct, but handle it defensively.

LanguageModelSession.GenerationError.Refusal

When the model declines to answer a prompt, it throws a Refusal error. Refusal is special because it includes an explanation:

do {
    let response = try await session.respond(to: prompt)
} catch let refusal as LanguageModelSession.GenerationError.Refusal {
    // Get the explanation as a complete Response<String>
    let explanation = try await refusal.explanation
    print(explanation.content)  // "I can't help with that because..."

    // Or stream it
    for try await snapshot in refusal.explanationStream {
        print(snapshot.content)
    }
}

Production Pattern: The Never-Throws Service

The cleanest production pattern is a service method that never throws — it returns the raw input unchanged on any failure. Callers have zero error handling burden, and worst case equals current pre-AI behaviour:

@available(iOS 26, *)
final class TranscriptNormalisationService {
    func normalise(_ rawTranscript: String) async -> String {
        guard !rawTranscript.trimmingCharacters(in: .whitespacesAndNewlines).isEmpty else {
            return rawTranscript
        }

        do {
            let session = LanguageModelSession {
                "Fix speech-to-text errors in BJJ transcripts."
                "Return only the corrected text."
            }
            let response = try await session.respond(
                to: Prompt { rawTranscript },
                generating: NormalisedTranscript.self
            )
            return response.content.normalisedText
        } catch {
            // Log the error, return the raw transcript unchanged
            GraplaLogger.data.error("Normalisation failed: \(error)")
            return rawTranscript
        }
    }
}

This pattern means:

  • The caller always gets a String back — no try/catch required
  • If AI is unavailable, the app works exactly as before
  • Errors are logged for debugging without surfacing to the user

Additional Production Patterns

Cache availability at setup, not per-call. SystemLanguageModel.default.availability has non-trivial overhead. Check it once when the view or service initialises and store the result. Availability doesn't change mid-session.

// ❌ Checking availability on every call
func normalise(_ text: String) async -> String {
    guard SystemLanguageModel.default.isAvailable else { return text }  // overhead each time
    ...
}

// ✅ Check once, cache
final class NormalisationService {
    private let isAvailable = SystemLanguageModel.default.isAvailable

    func normalise(_ text: String) async -> String {
        guard isAvailable else { return text }
        ...
    }
}

The fallback path is production code. On the majority of devices in 2026, Apple Intelligence will not be available (older hardware, non-supported regions, disabled in settings). Your non-AI code path is not a fallback — it is the primary path for most users. Test it as thoroughly as the AI path.

Use AnyObject? for iOS 26 services in SwiftUI views. Covered in Part 1, but worth repeating: avoid @available(iOS 26, *) on @State properties. Use AnyObject? and cast inside #available guards to prevent the constraint propagating to the whole view.


Part 11: Testing

The most important insight for testing FoundationModels code: most of your test suite should never touch the model. Well-structured FoundationModels code is testable at every layer without Apple Intelligence.

The Four Test Categories

1. Output Type Tests (No Model Required)

@Generable structs are plain data containers with memberwise initialisers. You can construct them directly in tests, verify Equatable conformance, and test edge cases without the model ever running:

@Suite("NormalisedTranscript")
struct NormalisedTranscriptTests {
    @Test func construction() {
        let result = NormalisedTranscript(
            normalisedText: "Worked Kimura from Half Guard",
            extractedTerms: ["Kimura", "Half Guard"]
        )
        #expect(result.normalisedText == "Worked Kimura from Half Guard")
        #expect(result.extractedTerms.count == 2)
        #expect(result.extractedTerms.contains("Kimura"))
    }

    @Test func equatable() {
        let a = NormalisedTranscript(normalisedText: "Test", extractedTerms: [])
        let b = NormalisedTranscript(normalisedText: "Test", extractedTerms: [])
        #expect(a == b)
    }

    @Test func emptyTerms() {
        let result = NormalisedTranscript(normalisedText: "Some text", extractedTerms: [])
        #expect(result.extractedTerms.isEmpty)
    }
}

These tests run in CI on any machine. No simulator required.

2. Service Fallback Tests (Works on All Simulators)

Test that your service returns the raw input unchanged when the model is unavailable. The simulator never has Apple Intelligence, so this path is always exercised:

@MainActor
@Suite("TranscriptNormalisationService")
struct TranscriptNormalisationServiceTests {
    @Test func emptyTranscriptReturnsEmpty() async {
        guard #available(iOS 26, *) else { return }
        let service = TranscriptNormalisationService()
        let result = await service.normalise("")
        #expect(result.isEmpty)
    }

    @Test func whitespaceOnlyReturnsUnchanged() async {
        guard #available(iOS 26, *) else { return }
        let service = TranscriptNormalisationService()
        let result = await service.normalise("   \n  ")
        #expect(result.trimmingCharacters(in: .whitespacesAndNewlines).isEmpty)
    }

    @Test func unavailableModelReturnsFallback() async {
        guard #available(iOS 26, *) else { return }
        // On simulator, model is unavailable — service must return raw transcript
        let service = TranscriptNormalisationService()
        let raw = "worked on kimora from half card today"
        let result = await service.normalise(raw)
        // On device: could be corrected. On simulator: must equal raw input.
        #expect(!result.isEmpty)  // just verify it doesn't crash
    }
}

3. Availability Tests (Works Everywhere)

Verify your availability checking code runs without crashing. Do not assert the specific availability state — it varies by machine, OS, and whether Apple Intelligence is enabled:

@Test func availabilityCheckDoesNotCrash() {
    guard #available(iOS 26, *) else { return }
    let availability = SystemLanguageModel.default.availability
    // Just verify we get a valid state — don't assert which state
    switch availability {
    case .available:
        break  // fine
    case .unavailable:
        break  // also fine — expected on simulator
    @unknown default:
        break
    }
}

4. On-Device Tests (Manual, .disabled() by Default)

Mark tests that require a real device with Apple Intelligence as .disabled(). They are skipped in CI but can be run manually on a real device:

@Test("Normalises BJJ terms on-device", .disabled("Requires device with Apple Intelligence"))
func normalisesTermsOnDevice() async throws {
    guard #available(iOS 26, *) else { return }
    let service = TranscriptNormalisationService()
    let raw = "rolled today, worked on my kimora from half card"
    let result = await service.normalise(raw)
    // On a real device with AI, these should be corrected
    #expect(result.contains("Kimura") || result.contains("kimura"))
    #expect(!result.contains("kimora"))
}

To run these locally: open the test plan in Xcode, filter by the test name, and run on a connected iPhone 15 Pro or later with Apple Intelligence enabled.

Testing Checklist

TestRuns in CIRequires Apple Intelligence
@Generable type construction
@Generable equatable
Service empty input handling
Service fallback (model unavailable)
Availability check no-crash
End-to-end normalisation❌ (manual)

Aim for 100% automated coverage of everything above the model boundary. The on-device generation itself is integration-tested manually.


Part 12: Example Use Cases

These examples cover the range of tasks FoundationModels handles well. Each follows the same pattern: on-device, private, structured output, graceful fallback.


1. Sports / BJJ App — Domain-Specific Transcript Normalisation

Use case: Correct speech-to-text misrecognitions of BJJ terms before feeding into entity extraction.

Why FoundationModels: A regex can't handle "kimora" → "Kimura" contextually; a cloud API sends private training notes offsite. On-device gets both right.

@Generable
struct NormalisedTranscript {
    @Guide(description: "The transcript with BJJ terms corrected")
    var normalisedText: String

    @Guide(description: "Canonical BJJ terms extracted, e.g. ['Kimura', 'Half Guard']")
    var extractedTerms: [String]
}

Tools needed: None — pure text transformation. Session-per-call.


2. Recipe App — Ingredient Extraction From Voice

Use case: "I need some eggs, any kind of cheese, and that Italian herb" → structured shopping list.

Why FoundationModels: The model handles colloquial descriptions ("that Italian herb" → "basil"), vague quantities ("some"), and variety descriptions ("any kind of cheese") — none of which a regex can parse.

@Generable
struct Ingredient {
    @Guide(description: "Canonical ingredient name, e.g. 'basil', 'eggs'")
    var name: String

    @Guide(description: "Quantity as spoken, e.g. '2', 'some', 'a handful'")
    var quantity: String
}

@Generable
struct IngredientList {
    @Guide(description: "All ingredients mentioned", .minimumCount(1))
    var ingredients: [Ingredient]
}

Tools needed: None. Session-per-call.


3. Journaling App — Private Mood Tagging

Use case: Classify a journal entry's emotional tone without sending text to a cloud service.

Why FoundationModels: Journal entries are deeply personal. On-device is the only acceptable processing option — not a preference, a product requirement.

@Generable
enum PrimaryMood {
    case joyful, content, neutral, anxious, sad, angry, reflective
}

@Generable
struct MoodAnalysis {
    @Guide(description: "The dominant emotion in the entry")
    var primaryMood: PrimaryMood

    @Guide(description: "Intensity, 1 = mild, 5 = intense", .range(1...5))
    var intensity: Int

    @Guide(description: "Key themes, up to three", .maximumCount(3))
    var themes: [String]
}

Tools needed: None. Session-per-call.


4. Task Manager — Natural Language Task Parsing

Use case: "Remind me to call Mum next Tuesday afternoon" → structured task with date components and priority.

Why FoundationModels: Natural language date parsing ("next Tuesday"), intent extraction, and priority inference in a single call.

@Generable
struct ParsedTask {
    @Guide(description: "Clean task title, e.g. 'Call Mum'")
    var title: String

    @Guide(description: "Relative date reference as spoken, e.g. 'next Tuesday afternoon'")
    var dateReference: String

    @Guide(description: "Priority 1 (low) to 3 (high)", .range(1...3))
    var priority: Int
}

Tools needed: CurrentDateTool to anchor relative dates ("next Tuesday" needs to know what today is).


5. Fitness App — Workout Log Summarisation

Use case: After a training session, summarise a structured workout log into a human-readable weekly review.

Why FoundationModels: Summary generation from structured data into natural prose. Streaming makes it feel responsive.

// No @Generable needed — plain text output, streamed
let stream = session.streamResponse(
    to: "Summarise this week's training in 2 paragraphs: \(workoutLogJSON)"
)
for try await snapshot in stream {
    summaryView.text = snapshot.content  // live update as text generates
}

Tools needed: None. Session-per-call. Use streamResponse() for the typing effect.


6. Developer Tool — Conventional Commit Message Generation

Use case: Given a summary of changed files and diff, generate a conventional commit message.

Why FoundationModels: Requires understanding intent from code changes — beyond simple pattern matching, but doesn't need frontier reasoning. On-device keeps source code private.

@Generable
enum CommitType {
    case feat, fix, chore, docs, refactor, test, perf
}

@Generable
struct CommitMessage {
    @Guide(description: "Conventional commit type")
    var type: CommitType

    @Guide(description: "Affected scope, e.g. 'auth', 'ui', nil if unclear")
    var scope: String?

    @Guide(description: "Imperative subject line, 72 chars max")
    var subject: String

    @Guide(description: "Optional body with context on why this change was made")
    var body: String?
}

Tools needed: None. Session-per-call.


7. Language Learning App — Sentence Correction

Use case: Correct a learner's written sentence while preserving their intended meaning.

Why FoundationModels: Grammar correction requires semantic understanding — the model must know what the learner was trying to say. On-device matters here too: learners write embarrassing mistakes they would prefer not to send to a cloud API.

@Generable
struct CorrectedSentence {
    @Guide(description: "The corrected sentence with natural grammar")
    var correctedText: String

    @Guide(description: "Explanations of corrections made, e.g. ['Changed tense from past to present perfect']")
    var explanations: [String]

    @Guide(description: "Confidence the original meaning was preserved, 1-5", .range(1...5))
    var meaningPreservedConfidence: Int
}

Tools needed: None. Session-per-call.


8. E-Commerce — Product Attribute Extraction

Use case: Extract structured attributes (colour, size, material, style) from free-text product descriptions for catalogue indexing.

Why FoundationModels: Product descriptions are unstructured prose. Structured extraction via @Generable is more robust than regex for the variety of descriptions sellers write.

@Generable
struct ProductAttributes {
    @Guide(description: "Primary colour(s), e.g. ['navy', 'white']")
    var colours: [String]

    @Guide(description: "Material, e.g. 'cotton', 'polyester blend'")
    var material: String?

    @Guide(description: "Style keywords, e.g. ['casual', 'slim-fit']", .maximumCount(5))
    var styleKeywords: [String]
}

Tools needed: Optional ProductCatalogTool to canonicalise values against your taxonomy.


9. Health App — Symptom Log Structuring

Use case: User dictates how they're feeling → structured symptom entry for a health log.

Why FoundationModels: Privacy is non-negotiable. Health data is the most sensitive category — on-device is not a preference, it's a product and ethical requirement.

@Generable
enum BodyArea {
    case head, chest, abdomen, back, leftArm, rightArm, leftLeg, rightLeg, general
}

@Generable
struct SymptomEntry {
    @Guide(description: "Primary affected body area")
    var bodyArea: BodyArea

    @Guide(description: "Symptom description in normalised clinical language")
    var description: String

    @Guide(description: "Severity 1 (mild) to 10 (severe)", .range(1...10))
    var severity: Int

    @Guide(description: "Duration as spoken, e.g. 'since this morning', 'two days'")
    var duration: String
}

Tools needed: None. Session-per-call.


10. Customer Support — Ticket Triage

Use case: Classify incoming support tickets by category, urgency, and sentiment to route them to the right team.

Why FoundationModels: Classification with semantic understanding. A keyword-based classifier misroutes tickets with indirect language; the model understands context.

@Generable
enum TicketCategory {
    case billing, technicalSupport, accountAccess, featureRequest, complaint, other
}

@Generable
enum CustomerSentiment {
    case positive, neutral, frustrated, angry
}

@Generable
struct TicketClassification {
    @Guide(description: "Primary support category")
    var category: TicketCategory

    @Guide(description: "Urgency 1 (low) to 5 (escalate immediately)", .range(1...5))
    var urgency: Int

    @Guide(description: "Customer emotional tone")
    var sentiment: CustomerSentiment

    @Guide(description: "One-sentence routing note for the support agent")
    var routingNote: String
}

Tools needed: Optional KnowledgeBaseTool to check if similar issues have documented resolutions before routing.


Part 13: Quick Reference & Anti-Patterns

Quick Reference

Key Types

TypeOne-liner
SystemLanguageModelEntry point — SystemLanguageModel.default
SystemLanguageModel.Availability.available / .unavailable(reason)
LanguageModelSessionManages one conversation thread; stateful
InstructionsSystem prompt — set once at session creation
PromptUser input for a single turn
Response<Content>Wrapper — always access .content
ResponseStream<Content>AsyncSequence of Snapshot<Content>
GenerationOptionstemperature, maximumResponseTokens, sampling
GenerationGuide<T>Constraints on @Guide properties
TranscriptLinear history of all session entries
ToolProtocol for functions the model can call
SystemLanguageModel.TokenUsage.tokenCount — cost of instructions/prompt/history

Session Init Cheatsheet

// Fresh session, no tools
LanguageModelSession { "Instructions here" }

// With specific model
LanguageModelSession(model: SystemLanguageModel.default) { "..." }

// With tools
LanguageModelSession(tools: [MyTool()]) { "..." }

// Resume from saved transcript
LanguageModelSession(model: .default, tools: [], transcript: savedTranscript)

respond() vs streamResponse()

respond()streamResponse()
ReturnsResponse<Content>ResponseStream<Content>
Best forBackground processing, pipelinesLive UI with typing effect
Partial resultsNoYes (via Snapshot<Content>)
Rate limit riskLowerHigher in background tasks
Collect to full responseN/A.collect()

@Generable vs Raw String

Use @Generable when:

  • You need structured, typed output (multiple fields)
  • You want compile-time guarantees on output shape
  • The response must be parsed/processed programmatically
  • You need constraints (@Guide) on values

Use raw String when:

  • Output is prose for display to the user
  • You're summarising or generating a paragraph
  • Streaming the output for a typing effect

Token Budget Formula

Total = instructions + tool definitions + transcript history + prompt + response

All compete for the same fixed window (~4,096 tokens). Response tokens come out of the same pool as input.

Tool vs Pre-Fetch vs Inject

If you...Do this
Always need the dataPre-fetch, inject into instructions
Sometimes need the dataDefine as Tool
Need data only when asked about itDefine as Tool
Have more than 5 toolsSplit into multiple focused sessions

Anti-Patterns

1. Accessing response instead of response.content

respond() returns Response<T>, not T. Always unwrap .content.

let text = try await session.respond(to: prompt)  // Response<String>, not String
text.uppercased()  // ❌ compile error

let text = try await session.respond(to: prompt).content  // ✅ String

2. Storing LanguageModelSession persistently when you don't need history

For stateless tasks (normalisation, extraction, classification), create a new session per call. Persistent sessions accumulate transcript and eventually hit the context limit.

3. Defining too many tools

Each tool definition consumes ~50–100 tokens of context budget, whether used or not. Keep it to 3–5 tools per session. If you have 10 tools, split them across multiple focused sessions.

4. Calling isAvailable or checkAvailability() per-call

Availability checking has overhead and doesn't change mid-session. Check once at service/view init and cache the result.

5. High temperature for structured/correction tasks

For @Generable types that correct or extract, use nil or temperature: 0.0–0.2. High temperature produces creatively varied — but wrong — corrections.

6. Long, elaborate instructions modelled on frontier model prompts

On a ~3B parameter model, shorter is better. Instructions over ~200 words dilute signal. Explicit rules outperform discursive descriptions.

7. Not testing the fallback path

On most devices today, Apple Intelligence is unavailable. Your non-AI code path is the primary experience for the majority of users. Test it as thoroughly as the AI path.

8. Using FoundationModels where a regex or simple function would do

If the task is a known, fixed pattern (extract a UUID, validate an email, format a date), use a deterministic function. LLM overhead — latency, availability, complexity — is waste for these cases.

9. Propagating @available(iOS 26, *) to SwiftUI views

Adding @available to a @State property forces the whole view to require iOS 26. Use the AnyObject? pattern instead and cast inside #available guards.

10. Treating .modelNotReady as permanent

.modelNotReady means the model is downloading. It's transient. Show "not available right now" UI and retry later. Do not show a permanent "unsupported" state for this case.


Part 14: Context Engineering for On-Device AI

The context window is the most important constraint in FoundationModels. Everything else — prompt engineering, temperature, tool design — happens within it. Understanding how to engineer what goes into that window is the difference between a feature that works reliably and one that fails silently on complex inputs.

The Fundamental Constraint

The on-device model has a fixed context window of approximately 4,096 tokens shared across:

instructions + tool definitions + transcript history + current prompt + response

This is roughly 3,000 words (about 12 pages) of total input and output. That sounds like a lot until you try to inject meaningful app data.

A BJJ training app with 116 positions, each with a 200-word description: ~30,000 tokens — 7x the entire context window. Injecting "all your app data" into instructions is not a strategy; it's a crash waiting to happen.

What Breaks First

When you over-fill the context window you get GenerationError.exceededContextWindowSize. But the model also silently degrades before it throws — a model given 3,500 tokens of input in a 4,096 window has only 596 tokens for its response. For most tasks that's enough. For others it's not — and the failure mode is truncation, not an error.

Common over-injection mistakes:

DataTokens (approx)Problem
All SwiftData records (100+ items)10,000–50,000Massively exceeds window
Full JSON blob of one complex entity500–2,000May leave little room for response
Entire app configuration/preferences200–800Unnecessary; most not relevant
Complete conversation history (100 turns)2,000–5,000Pushes out current prompt

Pattern 1: Select, Don't Dump

The simplest and most impactful change: fetch only what's relevant to the current request.

// ❌ Dumps all 116 positions into context — will throw
let allPositions = try await queryService.fetchAllPositions()
let session = LanguageModelSession {
    "Here are all BJJ positions: \(allPositions.map(\.description).joined(separator: "\n"))"
}

// ✅ Fetches only positions relevant to the current question
let relevantPositions = try await queryService.fetchPositions(
    matching: userQuery,
    limit: 5  // 5 positions × ~200 tokens = ~1,000 tokens — fits comfortably
)
let session = LanguageModelSession {
    "Relevant positions: \(relevantPositions.map(\.summary).joined(separator: "\n"))"
}

Use SwiftData predicates and fetchLimit to constrain what you load before it reaches the context.

Pattern 2: Layered Injection

Inject summaries at the top level, with detail available on-demand via tools. The model sees the overview by default and only loads detail when it actually needs it:

// Layer 1 — always injected: position names only (~50 tokens for 116 positions)
let positionNames = positions.map(\.name).joined(separator: ", ")

// Layer 2 — injected only when needed via tool
struct PositionDetailTool: Tool {
    let name = "getPositionDetail"
    let description = "Returns full description and transitions for a named BJJ position."

    @Generable
    struct Arguments {
        var positionName: String
    }

    func call(arguments: Arguments) async -> String {
        // Fetch the full detail only when the model asks for it
        return await loadPositionDetail(arguments.positionName)
    }
}

let session = LanguageModelSession(tools: [PositionDetailTool()]) {
    "Available positions: \(positionNames)"
    "Use getPositionDetail to look up full information about any position."
}

This keeps the base context lean (~50 tokens for names vs 30,000 for all descriptions) while still giving the model access to full detail on demand.

Pattern 3: The Two-Step Compression Pipeline

For tasks that require reasoning over large datasets, compress first, then reason. This only makes sense on-device — with a cloud API you pay per token on both calls and gain nothing. On-device both calls are free and private:

// Step 1: Summarise the large dataset (fresh session, large input is fine)
func summariseTrainingHistory(_ sessions: [TrainingSession]) async throws -> String {
    let session = LanguageModelSession {
        "Summarise this training history in 150 words, highlighting patterns and progress."
    }
    let fullHistory = sessions.map(\.description).joined(separator: "\n\n")
    return try await session.respond(to: fullHistory).content
    // fullHistory might be 5,000 tokens — fills most of the window, but that's fine
    // The output is ~150 tokens
}

// Step 2: Reason with the summary (fresh context, compact input)
func answerWithHistory(question: String, summary: String) async throws -> String {
    let session = LanguageModelSession {
        "Training history summary: \(summary)"  // ~150 tokens
        "Answer questions about training progress based on this summary."
    }
    return try await session.respond(to: question).content
    // Plenty of context headroom for question + answer
}

// Usage
let summary = try await summariseTrainingHistory(recentSessions)
let answer  = try await answerWithHistory(question: userQuestion, summary: summary)

The summary call uses most of its window for the raw data and produces a compact output. The reasoning call has clean context with just the summary. Each call is focused on a single task.

Pattern 4: Pre-Summarise at Write Time

For persistent app data (SwiftData entities), generate summaries when the data is saved and store them alongside the entity. The summary is computed once and reused for every future AI interaction:

@Model
final class TrainingSession {
    var rawNotes: String = ""
    var date: Date = Date()
    var techniques: [String] = []

    // Pre-generated — computed at save time, reused in every AI call
    var aiSummary: String = ""
}

// When saving a session
func saveSession(_ session: TrainingSession) async {
    // Generate summary once at write time
    if #available(iOS 26, *) {
        let model = LanguageModelSession {
            "Summarise this BJJ training session in 50 words."
        }
        let summary = try? await model.respond(
            to: session.rawNotes + "\nTechniques: \(session.techniques.joined(separator: ", "))"
        ).content
        session.aiSummary = summary ?? ""
    }
    modelContext.insert(session)
    try? modelContext.save()
}

// At query time — inject pre-built summaries, not raw notes
func buildSessionContext(recentSessions: [TrainingSession]) -> String {
    recentSessions
        .map { "[\($0.date.formatted())]: \($0.aiSummary)" }
        .joined(separator: "\n")
    // Each summary: ~50 tokens × 10 sessions = 500 tokens — fits comfortably
}

Pre-summarisation at write time means:

  • Zero AI cost at query time — the summary is already there
  • The context load is predictable and bounded by summary length
  • The summary can be updated when the entity changes

Dataset Size Reference

ContentVolumeApprox. TokensFits in Context?
Single entity description1200–500✅ Yes
Entity names list100~150✅ Yes
Short entity summaries10~500✅ Yes
Short entity summaries50~2,500⚠️ Tight
Full entity descriptions10~2,000⚠️ Tight
Full entity descriptions50+10,000+❌ No
Full entity descriptions100+20,000+❌ No
Conversation (10 turns)~1,000✅ Yes
Conversation (50 turns)~5,000❌ No

Decision Tree

Do you need to inject app data into context?
│
├── Yes → How much data?
│   │
│   ├── 1–5 entities, full detail
│   │   └── Inject directly into instructions
│   │
│   ├── 5–20 entities
│   │   ├── Always need all of them? → Inject summaries (pre-generated at write time)
│   │   └── Only need some? → Names in instructions + detail via Tool
│   │
│   └── 20+ entities
│       ├── Need to reason across all of them? → Two-step: summarise first, then reason
│       └── Need specific ones? → Select with predicate, inject summaries for matched
│
└── No → Standard session-per-call, no data injection needed

On-Device vs Cloud: Why This Pattern Is Different

With cloud APIs (OpenAI, Anthropic, Google), the two-step pattern is often not worth it: you pay per token on both calls, and the total cost may be similar to one call with the full data — especially if the summarisation model is also expensive.

On-device, the economics flip:

  • No per-token cost — both calls are free
  • No network latency — both calls run locally, typically in under a second each
  • No privacy concern — data never leaves the device regardless of call count
  • Shared resource — each call consumes system resources and may be rate-limited, so compact contexts are still preferred

This makes on-device AI uniquely suited to multi-step pipelines where cloud would be prohibitively expensive or slow.


Part 15: Advanced Patterns

This section covers patterns that don't fit neatly into any earlier part — actor isolation details, the non-obvious syntax for @Generable enums with associated values, reactive availability monitoring in SwiftUI, chaining model output back as prompt input via PromptRepresentable, and the bounded domain injection pattern for apps with curated entity datasets.


Actor Isolation and call(arguments:) — What Actor Does Your Code Run On?

Understanding actor isolation in FoundationModels matters when your tool or service touches @MainActor-bound state.

Tool.call(arguments:) Is @concurrent

The call(arguments:) method on the Tool protocol is implicitly @concurrent, which means it runs off the main actor — in a generic concurrent executor, not @MainActor. This is deliberate: the model calls your tool during inference, which itself is off the main actor. Calling back to the main actor mid-inference would require a hop, adding latency.

@available(iOS 26, *)
struct TrainingHistoryTool: Tool {
    let name = "getTrainingHistory"
    let description = "Returns recent training sessions."

    @Generable
    struct Arguments {
        var limit: Int
    }

    // This runs @concurrent — NOT on @MainActor
    func call(arguments: Arguments) async -> String {
        // ✅ Pure computation or actor-independent async work is fine here
        let sessions = await fetchSessions(limit: arguments.limit)
        return sessions.map(\.summary).joined(separator: "\n")

        // ❌ Accessing @MainActor-bound state directly will cause a data race warning
        // return self.someMainActorProperty  // won't compile
    }
}

If your tool genuinely needs main-actor state (e.g., reading from a @MainActor service), hop explicitly:

func call(arguments: Arguments) async -> String {
    // Hop to MainActor to read the value, then hop back
    let data = await MainActor.run {
        myMainActorService.currentData
    }
    return process(data)
}

What Actor Does respond() Run On?

LanguageModelSession.respond() is async but has no actor isolation requirement — it is safe to call from any actor context, including @MainActor. Internally, the framework dispatches inference to a background executor automatically.

// ✅ Calling respond() from @MainActor is fine — the framework handles the dispatch
@MainActor
final class NormalisationService {
    func normalise(_ text: String) async -> String {
        let session = LanguageModelSession { "Fix BJJ terms." }
        // respond() is async but not @MainActor — call is fine from here
        let response = try? await session.respond(to: Prompt { text })
        return response?.content ?? text
    }
}

You do not need to manually Task.detach or use Task { @concurrent in ... } before calling respond(). The framework does the right thing automatically.

@MainActor Services Calling Tools — The Safe Pattern

When a @MainActor service needs tools that access non-@MainActor data, the cleanest pattern is to make the tool capture any main-actor dependencies at session creation time (before inference begins), rather than accessing them from within call():

@MainActor
final class CoachingService {
    private let userProfile: UserProfile  // @MainActor bound

    func answer(_ question: String) async -> String {
        // Capture the profile value NOW, on MainActor, before the session runs
        let profileSummary = userProfile.summary  // safe — we're on MainActor

        // The tool closes over the already-captured value — no actor hop needed in call()
        struct ProfileContextTool: Tool {
            let name = "getUserProfile"
            let description = "Returns the user's training profile."
            @Generable struct Arguments {}
            let summary: String  // captured at creation time
            func call(arguments: Arguments) async -> String { summary }
        }

        let session = LanguageModelSession(tools: [ProfileContextTool(summary: profileSummary)]) {
            "Answer BJJ coaching questions using the user's profile."
        }
        return (try? await session.respond(to: question).content) ?? ""
    }
}

This is simpler than hopping to MainActor inside call() and avoids any potential race conditions.


@Generable Enums With Associated Values

The earlier enum examples in Part 4 showed simple case enums (.positive, .neutral, .negative). @Generable also supports enums with associated values — but the syntax has a specific constraint: all associated values must themselves conform to Generable (or be types that @Generable already knows how to handle: String, Int, Double, Bool, arrays of generable types).

Basic Associated Value Enum

@available(iOS 26, *)
@Generable
enum TranscriptCorrection {
    case termCorrection(original: String, corrected: String)
    case spellingFix(original: String, corrected: String)
    case noChange
}

@Generable
struct AnnotatedTranscript {
    @Guide(description: "The corrected transcript text")
    var correctedText: String

    @Guide(description: "Each correction made, with original and corrected forms")
    var corrections: [TranscriptCorrection]
}

The model generates each corrections element as a tagged union — it chooses the case name and then generates the associated values. This is significantly richer than a flat string array for corrections, because the output is fully typed.

Nested @Generable Structs as Associated Values

Associated values can also be @Generable structs:

@available(iOS 26, *)
@Generable
struct DateRange {
    @Guide(description: "Start date in YYYY-MM-DD format")
    var start: String
    @Guide(description: "End date in YYYY-MM-DD format")
    var end: String
}

@Generable
enum ScheduleIntent {
    case singleDay(date: String)
    case dateRange(range: DateRange)
    case recurring(dayOfWeek: String, startTime: String)
    case unspecified
}

@Generable
struct ParsedScheduleRequest {
    @Guide(description: "What the user wants to schedule")
    var activity: String

    @Guide(description: "When the user wants to schedule it")
    var timing: ScheduleIntent
}

When to Use Associated Value Enums vs Flat Structs

Use associated value enums when the output shape is fundamentally discriminated — the presence of one field makes others meaningless. In the ScheduleIntent example above, if the user said "every Monday at 9am", the .recurring case makes date and range meaningless, and a flat struct would leave those fields awkwardly nil.

Use flat @Generable structs with optional properties when most combinations of values are valid. The associated value enum excels when the cases are truly mutually exclusive and each has distinct associated data.

The Constraint: All Associated Values Must Be Generable

If you include a type that is not Generable-conformant as an associated value, the @Generable macro will emit a compile-time error. The fix is always one of:

  1. Add @Generable to the associated type
  2. Change the associated type to a primitive (String, Int, etc.)
  3. Represent it as a separate @Generable struct with its own properties

Observable Availability Monitoring — Reactive SwiftUI Pattern

SystemLanguageModel is an Observable final class. This means SwiftUI views can react to .availability changes without any additional wiring — the view re-renders automatically when availability changes.

This is useful when you want to show/hide AI features reactively, for example when the model finishes downloading (.modelNotReady.available) while the user is already in the app.

Basic Reactive Availability View

@available(iOS 26, *)
struct AIFeatureBadge: View {
    var body: some View {
        // SwiftUI observes SystemLanguageModel.default automatically
        // because it's @Observable — no @StateObject, no manual subscription
        let model = SystemLanguageModel.default

        switch model.availability {
        case .available:
            Label("AI Ready", systemImage: "sparkles")
                .foregroundStyle(.green)
        case .unavailable(.modelNotReady):
            Label("AI Downloading...", systemImage: "arrow.down.circle")
                .foregroundStyle(.yellow)
        case .unavailable(.appleIntelligenceNotEnabled):
            Label("Enable Apple Intelligence", systemImage: "exclamationmark.circle")
                .foregroundStyle(.secondary)
        case .unavailable(.deviceNotEligible):
            EmptyView()  // Don't surface this — it's permanent
        @unknown default:
            EmptyView()
        }
    }
}

Because SystemLanguageModel is @Observable, SwiftUI tracks which properties the body reads and re-renders when they change. No .onReceive, no Combine, no explicit observation setup.

Watching for the Model Becoming Ready

The .task {} modifier is the right tool for reacting to an availability change and triggering a one-time action — for example, kicking off an initial data enrichment pass once the model becomes available:

@available(iOS 26, *)
struct TrainingDashboardView: View {
    @State private var hasRunInitialEnrichment = false

    var body: some View {
        // ... view content ...
        .task {
            // This task runs when the view appears and re-runs if availability changes
            for await _ in SystemLanguageModel.default.availabilityUpdates {
                guard !hasRunInitialEnrichment else { break }
                if SystemLanguageModel.default.isAvailable {
                    await runInitialEnrichment()
                    hasRunInitialEnrichment = true
                }
            }
        }
    }

    private func runInitialEnrichment() async {
        // Generate AI summaries for any entities that don't have them yet
    }
}

Note: If availabilityUpdates is not available on your OS target, use .task(id: SystemLanguageModel.default.availability) as an alternative — the task re-runs when availability changes since Availability is Equatable:

.task(id: SystemLanguageModel.default.availability) {
    guard SystemLanguageModel.default.isAvailable else { return }
    guard !hasRunInitialEnrichment else { return }
    await runInitialEnrichment()
    hasRunInitialEnrichment = true
}

Avoiding the Per-View @available Constraint

The reactive pattern works cleanly with the AnyObject? wrapping approach from Part 1. Keep the Observable observation inside a #available check, or confine it to a view that is itself conditionally shown:

// In the parent view (no iOS 26 requirement):
var body: some View {
    VStack {
        mainContent
        if #available(iOS 26, *) {
            AIStatusBadge()  // only this view requires iOS 26
        }
    }
}

This way the availability-reactive logic is isolated to a specific subview, and the containing view has no version constraint.


PromptRepresentable — Chaining Model Output Back as Input

One of the cleaner architectural patterns enabled by the protocol hierarchy is output-as-input chaining: taking a @Generable type from one call and passing it directly as prompt input to the next call, without any serialisation step.

This works because @Generable types conform to PromptRepresentable (via ConvertibleToGeneratedContent), which means they can appear directly in a @PromptBuilder closure.

Basic Chaining Example

@available(iOS 26, *)
@Generable
struct NormalisedTranscript {
    @Guide(description: "Corrected transcript text")
    var normalisedText: String
    @Guide(description: "BJJ terms found, in canonical form")
    var extractedTerms: [String]
}

@Generable
struct SessionSummary {
    @Guide(description: "One-paragraph summary of the training session")
    var summary: String
    @Guide(description: "Techniques practiced, from the corrected terms")
    var techniquesWorked: [String]
}

// Two-step pipeline: correct → summarise
func processTranscript(_ raw: String) async throws -> SessionSummary {
    // Step 1: Correct BJJ terminology
    let correctionSession = LanguageModelSession {
        "Fix speech-to-text errors in BJJ transcripts. Return corrected text and term list."
    }
    let corrected = try await correctionSession.respond(
        to: Prompt { raw },
        generating: NormalisedTranscript.self
    )

    // Step 2: Summarise — pass the @Generable output directly as prompt input
    // No JSON encoding, no manual string building needed
    let summarySession = LanguageModelSession {
        "Summarise a BJJ training session given a corrected transcript."
    }
    let summary = try await summarySession.respond(
        to: Prompt {
            "Transcript: \(corrected.content)"  // NormalisedTranscript directly in @PromptBuilder
        },
        generating: SessionSummary.self
    )
    return summary.content
}

The \(corrected.content) interpolation works because NormalisedTranscript (a @Generable struct) conforms to PromptRepresentable. The framework serialises it appropriately for the model — you never touch the intermediate representation.

When Chaining Is Worth It

The chain pattern is most valuable when:

  • Output type 1 contains richer structure than a plain string — passing the full NormalisedTranscript (with both normalisedText and extractedTerms) to the next session gives the model more signal than a plain corrected string
  • Each step is a focused, single-task session — staying true to the "one task per session" principle (Part 3) while getting compound results
  • You want typed output at every step — rather than a single sprawling @Generable struct trying to do everything, each step produces its own clean type

Avoid chaining when the first step's output is a plain String — in that case, just use string interpolation normally. The PromptRepresentable chaining is most valuable for multi-property structured output.


Bounded Domain Injection — The Names-Only Pattern

This is a specialised context engineering pattern for apps that have a fixed, curated, known domain — a set of entities whose names are meaningful and bounded. The insight is that entity names alone are remarkably compact while still giving the model strong domain grounding.

The Core Insight

In Grapla, there are 116 BJJ positions, 150 techniques, 118 submissions, and 141 movements — 525 total entities. Injecting all the descriptions for all 525 entities would require tens of thousands of tokens and overflow the context window many times over.

But injecting just the names is cheap:

Mount, Half Guard, Side Control, Back Mount, Turtle, North-South, Closed Guard,
Open Guard, De La Riva, X-Guard, Butterfly Guard, Single Leg X, ...
Kimura, Armbar, Triangle, Rear Naked Choke, D'Arce, Anaconda, Omoplata, ...
Hip Bump Sweep, Flower Sweep, Scissor Sweep, Pendulum Sweep, ...

A full list of ~525 entity names in CSV format uses approximately 700–900 tokens — well within a 4,096-token window, leaving ample room for instructions, prompt, and response.

Why Names Alone Are Sufficient for Correction Tasks

For a transcript correction service, the model's job is:

  1. Recognise that "kimora" is a garbled version of a known entity
  2. Replace it with the canonical form "Kimura"

The model doesn't need the description of a Kimura to know that "kimora" should be "Kimura". The name list acts as a canonical term index — the model can fuzzy-match against it and apply corrections.

@available(iOS 26, *)
struct BJJEntityNames {
    // Pre-built at app startup from the SwiftData store — reused for every normalisation call
    static let positions = [
        "Mount", "Half Guard", "Side Control", "Back Mount", "Turtle",
        "North-South", "Closed Guard", "Open Guard", "De La Riva", "X-Guard",
        "Butterfly Guard", "Single Leg X", "Full Guard", "Rubber Guard",
        // ... all 116 positions
    ]
    static let techniques = [ /* all 150 */ ]
    static let submissions = [ /* all 118 */ ]
    static let movements   = [ /* all 141 */ ]

    static var allAsCSV: String {
        (positions + techniques + submissions + movements).joined(separator: ", ")
    }
}

@available(iOS 26, *)
final class TranscriptNormalisationService {
    func normalise(_ rawTranscript: String) async -> String {
        let entityNames = BJJEntityNames.allAsCSV  // ~700 tokens

        let session = LanguageModelSession {
            "Fix speech-to-text errors in BJJ training transcripts."
            "Canonical entity names: \(entityNames)"
            "Correct misrecognised terms to their canonical forms. Return only the corrected text."
        }
        // Total instructions: ~750 tokens — leaves ~3,300 tokens for prompt + response
        let response = try? await session.respond(to: Prompt { rawTranscript })
        return response?.content ?? rawTranscript
    }
}

Generalising the Pattern

The bounded domain pattern works whenever your app has a finite, knowable set of canonical terms. Some examples:

AppBounded DomainNames-Only Size
BJJ app525 positions/techniques/submissions/movements~700 tokens
Recipe app500 common ingredients~600 tokens
Medical notes300 ICD-10 conditions (common subset)~400 tokens
Developer tool200 API method names~250 tokens
Music app400 instruments + musical terms~500 tokens

The test for whether this pattern applies: Can you enumerate all the canonical terms your app cares about? If yes, inject the names list. The model will use it as a correction index without needing any descriptions.

Names-Only vs Names + Detail

Combine with the Layered Injection pattern (Part 14) when you sometimes need both correction and reasoning about entities:

let session = LanguageModelSession(tools: [PositionDetailTool()]) {
    // Layer 1: names always present (~700 tokens) — enables correction
    "Canonical BJJ entities: \(BJJEntityNames.allAsCSV)"

    // Layer 2: detail available on demand via tool — enables reasoning
    "Use getPositionDetail to look up descriptions, transitions, and techniques for any position."
}

This gives the model correction capability (names) plus on-demand depth (tool) while keeping the base context compact.


Experimental Directions

These patterns are worth exploring but untested at scale. They use only FoundationModels — no additional frameworks required.

Sharded parallel sessions. When your vocabulary corpus is too large for a single context but you need full coverage, split it across multiple sessions running concurrently. Each session holds a different shard of the names list. After all sessions return, merge results — prefer any correction over "unchanged", break ties by confidence or frequency. The on-device model's free-per-call economics make this viable in a way that would be expensive with a cloud API.

async let positions = normalise(rawText, vocabulary: BJJEntityNames.positions)
async let techniques = normalise(rawText, vocabulary: BJJEntityNames.techniques)
async let submissions = normalise(rawText, vocabulary: BJJEntityNames.submissions)

let (p, t, s) = try await (positions, techniques, submissions)
let merged = merge(p, t, s)  // your logic for combining corrections

Adaptive context budgeting. Before injecting data, measure how much headroom you have with tokenUsage(for:), then fill to a target percentage (e.g. 60% of the window, reserving 40% for prompt + response). Rank your entities by relevance and inject greedily until you hit the budget. This turns context injection from a static decision into a runtime one.

let instrTokens = try await model.tokenUsage(for: instructions).tokenCount
let window = await model.contextSize
let budget = Int(Double(window) * 0.6) - instrTokens  // 60% target, minus instructions

var injected: [String] = []
var used = 0
for entity in rankedEntities {
    let cost = estimateTokens(entity.name)  // ~1.3 tokens per word
    guard used + cost <= budget else { break }
    injected.append(entity.name)
    used += cost
}

Transcript as structured cache. Rather than rehydrating a conversation, use a saved Transcript as a compressed knowledge cache — pre-generate a transcript that contains a curated Q&A exchange about your domain (e.g. "what is a Kimura?" → model's answer), then resume from that transcript for every live session. The model starts with pre-baked domain knowledge already in its context, without spending live call tokens to establish it.

All three patterns are speculative — they depend on how the model handles parallel resource contention, whether adaptive sizing materially improves output quality, and whether transcript rehydration preserves semantic coherence. The #Playground macro is the fastest way to validate any of them before committing to an implementation.


Resources

Official Apple Documentation

WWDC 2025 Sessions:

  • Session 286: Meet the Foundation Models framework
  • Session 301: Deep dive into the Foundation Models framework
  • Session 259: Code-along: Bring on-device AI to your app using the Foundation Models framework

Framework Updates:

  • February 2026: Improved instruction-following, tokenUsage(for:), contextSize, #Playground macro

Key Types at a Glance

TypePurpose
SystemLanguageModelEntry point — access the model, check availability
LanguageModelSessionManages a single conversation thread with the model
InstructionsSystem-level behaviour definition for a session
PromptUser input to the model
Response<Content>Wrapper around typed model output — use .content
ResponseStream<Content>Async sequence of partial responses for streaming
GenerationOptionsControls temperature, sampling, max tokens
GenerationGuide<T>Constraint on @Guide properties (min/max/regex)
GeneratedContentUntyped structured output — escape hatch
TranscriptLinear history of a multi-turn session
ToolProtocol for functions the model can call during generation
SystemLanguageModel.TokenUsageToken count for a prompt, instructions, or transcript