Reflex Agents and Action-Based Markov Models

The Exfiltration Problem

Consider a problem for an agent tasked with exfiltrating sensitive data from a network. The agent can only perform one of five valid actions:

Scan Net: Identify active IP ranges in the network.
Scan Host: Probe individual hosts for open services.
Exploit Service: Exploit known vulnerabilities on a host.
Find Data: Search the compromised host for sensitive data.
Exfiltrate: Transfer the collected data out of the network.

CASE I: Reflex Agent without Percepts

A reflex agent is typically defined as an agent that makes decisions based directly on percepts—the immediate inputs it receives from the environment. However, if the agent operates without percepts, it deviates from the classic definition of a reflex agent. That said, it is possible to design reflex-like agents with pre-programmed rules or purely stochastic behavior in the absence of percepts.

In theory, an agent can function without percepts if its behavior is driven by:

Random or stochastic action selection (as in the i.i.d. case).
Pre-determined sequences of actions, resembling a hard-coded behavior pattern.
Cyclic or repetitive behavior, where the agent follows fixed loops of actions independent of environmental feedback.

Such an agent behaves reflexively by reacting based on internal rules or randomness, even though it does not receive information from the environment.

When applied to the Exfiltration Problem, the agent selects actions in a deterministic way. For instance, if it detects a reachable host, it might immediately perform Scan Host. Similarly, if it identifies a vulnerable service, it might proceed to Exploit Service. This agent acts purely based on the present situation and does not plan its actions beyond the current step.

CASE II: Action Independence

In the independent action case, the agent selects each action independently, with no influence from previous actions. This process can be modeled as an independent and identically distributed (i.i.d.) random selection.

Random Agent Example

In this example, the agent randomly selects one of the five valid actions at each time step. The action probabilities are uniform:

\[ P(A_t = a_i) = \frac{1}{5}, \quad \forall i \in \{1, 2, 3, 4, 5\}. \]

The agent operates without any memory, meaning it might take irrational sequences of actions. For example, it could:

Repeatedly perform Scan Net even after all hosts are identified.
Attempt to Exfiltrate data without having found any.

This behavior reflects the simplicity and limitations of purely random, independent action selection.

CASE III: Action Dependence (Markov Chain over Actions)

In the dependent action case, the agent’s next action depends on the previous action. This dependency can be modeled as a Markov chain where the choice of the next action depends only on the current action, following the Markov property.

Markov Chain Agent Example

The behavior of the agent is governed by a transition matrix:

\[ P(a_{t+1} \mid a_t), \]

which gives the probability of selecting the next action (a_{t+1}) given the current action (a_t).

Example Transition Matrix

\[ P = \begin{bmatrix} P(a_1 \mid a_1) & P(a_2 \mid a_1) & \cdots & P(a_5 \mid a_1) \\ P(a_1 \mid a_2) & P(a_2 \mid a_2) & \cdots & P(a_5 \mid a_2) \\ \vdots & \vdots & \ddots & \vdots \\ P(a_1 \mid a_5) & P(a_2 \mid a_5) & \cdots & P(a_5 \mid a_5) \end{bmatrix} \]

This model ensures that the agent follows reasonable sequences of actions. For instance:

\[ \text{Scan Net} \rightarrow \text{Scan Host} \rightarrow \text{Exploit Service} \rightarrow \text{Find Data} \rightarrow \text{Exfiltrate}. \]

However, the agent can still repeat unnecessary actions if the transition probabilities are not optimized. For example, after successfully exfiltrating data, it might restart by scanning the network.

Optimization of the probabilities

Two different approaches were tested. Genetics Algorithms and GPT4. We run each approach 30 times and then we calculate an action transition matrix for each of them.

Genetic algorithm

In this genetic algorithm (GA), the population consists of 2,500 individuals, each representing a sequence of 100 actions, including the selection of target and source IPs. The fitness function is piecewise, with different evaluation criteria for individuals based on whether they have reached the goal or not. For those that have not reached the goal, the fitness function focuses on partial progress towards the objective, while for those that have, it emphasizes on reducing the amount of steps an individual takes, and always returns a better score than non winning individuals

In this case, the the transition matrix is calculated over all the winning solutions in the final population after the 30 runs of the genetic algorithm

GPT4

GPT4 agent is an agent that take information about the state from the enviroment. At the time \(t\), the agent receives the state \(s_t\), and the reward \(r_t\) processes the state and proposes a new action \(a_{t+1}\). The LLM-baseduse a textual representation of the state and do not learn a policy across multiple episodes. LLM agents select actions based on the knowledge accumulated during their initial pre-training and using prompt techniques such as in-context learning (ICL) . Furthermore, in the current design, the agents do not retain any information between episodes, i.e., they do not possess any long-term memory. Each episode starts from scratch, and the agent has to figure out how to reach the goal.

In this case, the transition matrix is calculated over all 30 runs of the GPT4

Results

Transition matrixs from Genetic Algorithm and GPT are implemented into the Markov Chain Agent. The results are shown in the table below.

Heuristics_taken_from	Solutions_analyzed	Average_steps	Standard_deviation	Win_percentage	Steps_best_solution
Genetic Agent	All	95.54	13.59	14.08%	5
Genetic Agent	Best Solution per episode	11.96	3.06	100%	5
Random Agent	All	98.97	6.55	3.64%	13
Random Agent	Best Solutions per episode	20.90	4.77	100%	13
GPT 4 Agent	All	97.57	9.98	8.2%	7
GPT 4 Agent	Best Solutions	17.67	4.30	100%	7