Stackelberg POMDP Formulation for PARE¶
1. Overview¶
The Proactive Agent Research Environment (PARE) models a two-agent system where a User interacts with mobile applications while a Proactive Agent observes user behavior and proposes helpful interventions. We formalize this as a Stackelberg POMDP with asymmetric information, state-dependent action spaces, and leader-follower turn structure.
Key Characteristics¶
- Asymmetric agents: User (leader) has state-constrained actions; agent (follower) has privileged access
- Partial observability: Agents receive different observations of the same state
- Stackelberg turns: Within each turn, user acts first, agent observes user action, then agent acts
- Dual rewards: Task success (terminal) + proposal acceptance (per-step)
2. Formal Definition¶
A PARE Stackelberg POMDP is defined by the tuple:
where \(\mathcal{N} = \{\mathbf{U}, \mathbf{A}\}\) denotes the user (leader) and the proactive agent (follower).
| Symbol | Name | Description |
|---|---|---|
| \(\mathcal{N}\) | Agent set | \(\{\mathbf{U}, \mathbf{A}\}\) -- user and proactive agent |
| \(\mathcal{S}\) | State space | Global environment state |
| \(\mathcal{A}_i\) | Action space | Actions available to agent \(i\) |
| \(T\) | Transition function | State dynamics |
| \(R\) | Reward function | Reward signal |
| \(\mathcal{O}_i\) | Observation space | What agent \(i\) can observe |
| \(\mathcal{I}\) | Instruction space | Task specifications for agents |
3. Component Definitions¶
3.1 Agent Set \(\mathcal{N}\)¶
Definition: The set of decision-making entities in the environment. \(\mathbf{U}\) is the leader (acts first), \(\mathbf{A}\) is the follower (observes user action before acting).
PARE Mapping:
- \(\mathbf{U}\): Simulated by
UserAgent- interacts with apps to accomplish tasks - \(\mathbf{A}\): Implemented by
ProactiveAgent- observes user and proposes interventions
Example: In a "schedule meeting" scenario:
- User navigates Email app to find meeting request
- Agent observes user reading email, proposes creating calendar event
3.2 State Space \(\mathcal{S}\)¶
3.2.1 App State \(\mathcal{S}_{\text{app}}\)¶
Definition: The current screen/view within the active application.
PARE Mapping: StatefulApp.current_state: AppState
Example (Contacts App):
Concrete instance: \(s_{\text{app}} = \texttt{ContactDetail}(\text{"C001"})\) - viewing contact with ID "C001"
3.2.2 Global State \(\mathcal{S}_{\text{global}}\)¶
Definition: Which application is currently active (foreground).
PARE Mapping: StateAwareEnvironmentWrapper.active_app: App
Example: \(s_{\text{global}} = \texttt{Email}\) means the Email app is in foreground.
3.2.3 Database State \(\mathcal{S}_{\text{db}}\)¶
Definition: The persistent data across all applications.
PARE Mapping: App backend data stores (e.g., contacts_app._backend.contacts)
Example:
s_db = {
contacts: {
"C001": Contact(first_name="Alice", email="alice@example.com"),
"C002": Contact(first_name="Bob", email="bob@example.com")
},
emails: {
"E001": Email(subject="Meeting Request", from="alice@example.com", folder="INBOX")
},
calendar: {...},
notes: {...}
}
3.2.4 History State \(\mathcal{S}_{\text{history}}\)¶
Definition: Truncated, bounded view of past events relevant to each agent's decision-making. Since history is truncated (finite window), \(|\mathcal{S}_{\text{history}}|\) is bounded, preserving the Markov property.
Components:
\(\mathcal{H}_\mathbf{U}\): User-relevant history
- Navigation stack for
go_back()functionality - Most recent agent proposal (if any)
- Recent environment notifications
\(\mathcal{H}_\mathbf{A}\): Agent-relevant history
- Recent user actions (truncated window)
- Agent's own previous proposals
- Recent environment notifications
PARE Mapping:
- Navigation:
StatefulApp.navigation_stack - User actions:
UserActionLogentries - Proposals:
AgentMessageLog(latest only) - Notifications:
EnvironmentNotificationLog
Example:
s_history = (
h_U: {
nav_stack: [ContactsList],
pending_proposal: "Create calendar event for meeting?",
env_notifs: ["New email from alice@example.com"]
},
h_A: {
user_actions: ["open_app(email)", "search_emails('meeting')", "open_email('E001')"],
last_proposal: "Create calendar event for meeting?",
env_notifs: ["Reminder: Meeting at 2pm"]
}
)
3.3 Action Spaces \(\mathcal{A}_i\)¶
3.3.1 User Action Space \(\mathcal{A}_\mathbf{U}\)¶
Definition: The user's available actions depend on the current app state. This is a state-dependent action space.
Constraint: At state \(s\), user must select from valid actions:
When a proposal from the agent is pending, the user's action space is augmented with \(\texttt{accept\_proposal}\) and \(\texttt{reject\_proposal}\).
PARE Mapping: environment.get_user_tools()
Example:
| State | Available Actions \(\mathcal{A}_\mathbf{U}(s)\) |
|---|---|
| \(s_{\text{app}} = \texttt{Home}\) | {open_app(contacts), open_app(email), open_app(calendar), ...} |
| \(s_{\text{app}} = \texttt{ContactsList}\) | {list_contacts(), search_contacts(), open_contact(), create_contact(), go_home(), switch_app()} |
| \(s_{\text{app}} = \texttt{ContactDetail(C001)}\) | {view_contact(), start_edit(), delete_contact(), go_back(), go_home()} |
| Any state + pending proposal | Above \(\cup\) {accept_proposal, reject_proposal} |
3.3.2 Proactive Agent Action Space \(\mathcal{A}_\mathbf{A}\)¶
Definition: The agent's action space is state-independent (privileged access).
Components:
- \(\mathcal{A}_{\text{read}}\): Read-only queries across all apps for information gathering
- \(\mathcal{A}_{\text{propose}} = \{\texttt{propose}(a) \mid a \in \mathcal{A}_{\text{privileged}}\}\): Propose a task to the user
- \(\texttt{wait}\): Continue observation without intervention
PARE Mapping: environment.get_tools()
Example:
- \(\texttt{list\_emails}(\text{folder="INBOX"})\) (read action)
- \(\texttt{propose}(\texttt{create\_calendar\_event}(\text{title="Meeting", time="2pm"}))\)
- \(\texttt{wait}\)
3.4 Transition Function \(T\)¶
Definition: Given current state and joint action, produces next state. Base model is deterministic; stochastic extension supports tool failure probability.
PARE Mapping: State transitions handled by handle_state_transition() in StatefulApp
Example:
Initial: \(s = (\texttt{ContactsList}, \texttt{Contacts}, \{...\}, ([], \emptyset, []))\)
Action: \(a_\mathbf{U} = \texttt{open\_contact}(\text{"C001"})\), \(a_\mathbf{A} = \texttt{wait}\)
Result: \(s' = (\texttt{ContactDetail("C001")}, \texttt{Contacts}, \{...\}, ([\texttt{ContactsList}], \emptyset, [...]))\)
3.5 Observation Spaces \(\mathcal{O}_i\)¶
3.5.1 User Observation \(\mathcal{O}_\mathbf{U}\)¶
The user's observation is a function of the current state only.
Components:
- \(\mathcal{O}_{\text{screen}}\): Current app and state information
- \(\mathcal{O}_{\text{tools}}\): Available actions with descriptions
- \(\mathcal{O}_{\text{proposals}}\): Pending proposal from agent (if any)
- \(\mathcal{O}_{\text{env}}\): Environment notifications (truncated -- e.g., new email shows sender and subject, not full content)
PARE Mapping: CurrentAppStateLog, AvailableToolsLog, AgentMessageLog, EnvironmentNotificationLog
Example:
O_U(s) = (
screen: "Email -> MailboxView(INBOX)",
tools: ["list_emails()", "search_emails()", "open_email()", ...],
proposal: "Agent suggests: Create calendar event for meeting?",
env: ["[10:00] New email received"]
)
3.5.2 Agent Observation \(\mathcal{O}_\mathbf{A}\)¶
The agent's observation is a function of both the state and the user's action, reflecting the Stackelberg structure.
Components:
- \(\mathcal{O}_{\text{user\_actions}}\): User's executed actions (not user's observations)
- \(\mathcal{O}_{\text{env}}\): Environment notifications
- \(\mathcal{O}_{\text{proposal\_response}}\): User's response to last proposal (if any)
PARE Mapping: UserActionLog, EnvironmentNotificationLog
Example:
O_A(s, a_U) = (
user_action: "open_email(id='E001')",
user_action_history: ["open_app(email)", "search_emails('meeting')"],
env: ["[09:55] Reminder: Meeting at 2pm"],
proposal_response: None
)
3.6 Instruction Space \(\mathcal{I}\)¶
Definition: Task specifications provided to each agent at episode start. Instructions are part of policy conditioning.
Components:
- \(\mathcal{I}_\mathbf{U}\): User's goal in natural language
- \(\mathcal{I}_\mathbf{A}\): Agent's objective (observe and assist)
PARE Mapping: Scenario definitions in PAREScenario
Example:
I_U = "You received an email about a meeting with Alice. Schedule it on your calendar."
I_A = "Observe user actions and propose helpful interventions when appropriate."
3.7 Reward Function \(R\)¶
Definition: PARE uses a dual reward structure:
3.7.1 Success Reward \(R_\textrm{Succeed}\)¶
Terminal reward indicating whether the user's goals \(\mathcal{G}_\mathbf{U}\) are fulfilled in the final environment state, as verified by the scenario oracle:
where \(\phi_{\text{goal}}\) is the goal predicate from the scenario oracle.
PARE Mapping: scenario.validate(env)
Example: For "create calendar event" task:
3.7.2 Acceptance Reward \(R_\textrm{Accept}\)¶
Per-step reward for proposal quality:
PARE Mapping: Computed from proposal_count and acceptance_count
Example:
- Agent proposes calendar event, user accepts: \(R_\textrm{Accept} = +1\)
- Agent proposes irrelevant action, user rejects: \(R_\textrm{Accept} = -1\)
- Agent waits (no proposal): \(R_\textrm{Accept} = 0\)
4. Episode Dynamics¶
4.1 Initialization¶
- Sample initial state \(s^0 \sim P_0(\mathcal{S})\) from scenario definition
- Provide instructions \(I = (I_\mathbf{U}, I_\mathbf{A}) \in \mathcal{I}\)
4.2 Turn Structure (Stackelberg)¶
Each turn \(t\) follows the Stackelberg structure: the user acts first, the agent observes the user's action, then the agent acts:
Turn t:
+---------------------------------------------------------------------------+
| USER PHASE (Leader) |
| 1. o_U^t = O_U(s^t) |
| 2. a_U^t ~ pi_U(. | o_U^t, h_U^t, I_U) |
| subject to: a_U^t in A_U(s_app^t, s_global^t) |
+---------------------------------------------------------------------------+
| AGENT PHASE (Follower) |
| 3. o_A^t = O_A(s^t, a_U^t) <- agent sees user's action |
| 4. a_A^t ~ pi_A(. | o_A^t, h_A^t, I_A) |
+---------------------------------------------------------------------------+
| ENVIRONMENT UPDATE |
| 5. s^{t+1} = T(s^t, a_U^t, a_A^t) |
| 6. r^t = R(s^t, a_U^t, a_A^t) |
| 7. Update histories: h_i^{t+1} = update(h_i^t, o_i^t, a_i^t) |
+---------------------------------------------------------------------------+
4.3 Termination¶
Episode ends when:
- \(t = T_{\max}\) (maximum turns reached), or
- Environment signals completion
4.4 Cumulative Return¶
5. Summary¶
| Component | Symbol | Type | PARE Implementation |
|---|---|---|---|
| Agents | \(\mathcal{N} = \{\mathbf{U}, \mathbf{A}\}\) | Set | {UserAgent, ProactiveAgent} |
| App State | \(\mathcal{S}_{\text{app}}\) | Finite | StatefulApp.current_state |
| Global State | \(\mathcal{S}_{\text{global}}\) | Finite | env.active_app |
| Database | \(\mathcal{S}_{\text{db}}\) | Structured | App backends |
| History | \(\mathcal{S}_{\text{history}}\) | Bounded | Logs, nav stack |
| User Actions | \(\mathcal{A}_\mathbf{U}(s)\) | State-dependent | env.get_user_tools() |
| Agent Actions | \(\mathcal{A}_\mathbf{A}\) | Fixed | {read(...), propose(...), wait} |
| Transition | \(T\) | Deterministic | handle_state_transition() |
| User Obs | \(\mathcal{O}_\mathbf{U}(s)\) | Function of \(s\) | Agent logs |
| Agent Obs | \(\mathcal{O}_\mathbf{A}(s, a_\mathbf{U})\) | Function of \(s\), \(a_\mathbf{U}\) | Agent logs |
| Instructions | \(\mathcal{I}\) | Natural language | PAREScenario |
| Success Reward | \(R_\textrm{Succeed}\) | \(\{0, 1\}\) | scenario.validate() |
| Acceptance Reward | \(R_\textrm{Accept}\) | \(\{-1, 0, 1\}\) | acceptance tracking |