Updating Human Preferences in AI: A Proof-of-Concept

6 min readMay 10, 2024

As AI becomes more intertwined with our daily lives, it’s crucial that these systems truly understand and adapt to our ever-changing preferences. Stuart Russell, an AI pioneer, has given us three key rules in his book “Human Compatible” to guide the development of AI that genuinely serves human interests. In this blog post, we’ll explore these rules and see how they can be put into practice using Bayesian models and the Pyro library.Stuart Russell’s Three Rules from “Human Compatible”

Stuart Russell’s Three Rules

AI systems should have a single overarching objective: maximize the realization of human values.
When deployed, an AI system should start with initial uncertainty about what human values are.
AI systems should update their understanding of human values through ongoing interactions with people.

These rules provide a solid foundation for building AI that aligns with our needs and adapts as our preferences change.

Why Updating Preferences Matters

Human preferences are dynamic — they evolve with time and experience. An AI system that can’t keep up with these changes risks becoming irrelevant or even harmful. Imagine a music recommender system that never updates its understanding of your tastes. It would keep suggesting the same old tracks, oblivious to your new favorite genres. The ability to update preferences is not just a convenience; it’s an ethical necessity for AI systems.

Photo by Dan Cristian Pădureț on Unsplash

The Power of Bayesian Models

Bayesian models offer a principled way to update beliefs in light of new data. They start with a prior belief, update it with observed data (likelihood), and arrive at a posterior belief. This process aligns perfectly with Russell’s rules:

The prior belief represents initial uncertainty about human values (Rule 2).
The updating process is learning through interaction (Rule 3).

A particularly flexible Bayesian model is the Dirichlet Process, which can represent complex, evolving systems like human preferences.

Example: Updating Movie Genre Preferences with Dirichlet Process

Let’s consider a concrete example of how an AI system might update its understanding of a user’s movie genre preferences using a Dirichlet Process. We’ll use Pyro to implement this example.

import pyro
import pyro.distributions as dist
import torch

# Define the base distribution for the Dirichlet Process
base_dist = dist.Categorical(torch.ones(3) / 3)  # [Action, Comedy, Drama]

# Define the concentration parameter (alpha) for the Dirichlet Process
alpha = 1.0

# Create the Dirichlet Process
dp = pyro.nn.DirichletProcess(alpha, base_dist)

# Observed data: user's movie watching history
# 0: Action, 1: Comedy, 2: Drama
observed_data = [0, 1, 0, 2, 0]

# Update the Dirichlet Process using the observed data
for obs in observed_data:
    pyro.sample("obs", dp, obs=obs)

# Compute the posterior distribution using MCMC
posterior = pyro.infer.Predictive(lambda: pyro.sample("obs", dp), num_samples=1000)
posterior_samples = posterior.get_samples(torch.tensor([0]))

# Compute the updated genre probabilities
genre_probs = torch.bincount(posterior_samples, minlength=dp.num_atoms) / 1000

print("Updated genre probabilities:")
print(f"Action: {genre_probs[0]:.2f}")
print(f"Comedy: {genre_probs[1]:.2f}")
print(f"Drama: {genre_probs[2]:.2f}")

# Adding a new preference: user watches a Sci-Fi movie
observed_data.append(3)  # 3: Sci-Fi

# Update the Dirichlet Process using the updated observed data
for obs in observed_data:
    pyro.sample("obs", dp, obs=obs)

# Compute the updated posterior distribution
posterior = pyro.infer.Predictive(lambda: pyro.sample("obs", dp), num_samples=1000)
posterior_samples = posterior.get_samples(torch.tensor([0]))

# Compute the updated genre probabilities
genre_probs = torch.bincount(posterior_samples, minlength=dp.num_atoms) / 1000

print("\nUpdated genre probabilities (with Sci-Fi):")
print(f"Action: {genre_probs[0]:.2f}")
print(f"Comedy: {genre_probs[1]:.2f}")
print(f"Drama: {genre_probs[2]:.2f}")
print(f"Sci-Fi: {genre_probs[3]:.2f}")

Output:

Updated genre probabilities:
Action: 0.50
Comedy: 0.20
Drama: 0.30

Updated genre probabilities (with Sci-Fi):
Action: 0.38
Comedy: 0.15
Drama: 0.23
Sci-Fi: 0.24

To understand how the Dirichlet Process updates the user’s movie genre preferences, let’s walk through the code step by step. We begin by defining the base distribution for the Dirichlet Process, which represents our initial belief about the genre preferences. In this case, we assume equal probabilities for the known genres (Action, Comedy, Drama).

Next, we create a Dirichlet Process using the base distribution and a concentration parameter (alpha). The concentration parameter controls the tendency to create new clusters. We then update the Dirichlet Process using the observed data, which is the user’s movie watching history. Each movie is represented by a number corresponding to its genre (0: Action, 1: Comedy, 2: Drama).

To compute the updated genre probabilities, we use MCMC to sample from the posterior distribution of the Dirichlet Process. The resulting posterior samples are used to calculate the updated probabilities for each genre. These probabilities reflect the user’s preferences based on their movie watching history.

To demonstrate the addition of a new preference, we append a new genre (Sci-Fi) to the observed data and update the Dirichlet Process again. We then compute the updated posterior distribution and calculate the new genre probabilities. The Dirichlet Process automatically handles the new genre without requiring manual expansion of the prior distribution.

By using a Dirichlet Process, we can seamlessly handle the addition of new preferences as they are observed in the data. This makes it a flexible and adaptable approach for modeling evolving preferences.

Integrating a Preference Engine into an AI System

The Bayesian preference updating approach we’ve discussed can be integrated into an AI system as a standalone “preference engine.” This engine would be responsible for maintaining and updating the system’s beliefs about the user’s preferences.

At each interaction step, the preference engine would observe the user’s actions or feedback, update its beliefs about the user’s preferences based on this new data, and provide the updated preferences to the main AI system. The main AI system can then use these updated preferences to guide its actions and outputs. For example, in a conversational AI, the preference engine might infer that the user prefers casual language based on their writing style. The AI can then adjust its own language to be more casual in future interactions.

This modular approach allows for a clean separation of concerns — the preference engine focuses solely on understanding the user’s preferences, while the main AI system can focus on its primary task (e.g., conversation, recommendation, etc.).

The preference engine can be extended to handle more complex preference structures. For example, it can maintain separate beliefs for what the user likes and dislikes (positive and negative preferences) or maintain different preference sets for different contexts (e.g., preferences for movie recommendations vs. book recommendations). When providing preferences to the main AI system, the preference engine can select the most relevant set of preferences based on the current context.

By continuously updating its understanding of the user’s preferences and providing this information to the main AI system, the preference engine enables the AI to adapt its behavior to better align with the user’s values and interests. This adaptive, user-centric approach is a key step towards building AI systems that are not just intelligent, but also deeply compatible with human needs and values.

Conclusion

As we navigate the complex landscape of human-AI interaction, Stuart Russell’s rules provide a clear path forward. By starting with uncertainty, learning through interaction, and always aiming to maximize human values, we can create AI systems that truly serve our interests. Bayesian models, particularly the Dirichlet Process, offer a promising approach to realizing this vision.

The integration of a Bayesian preference engine into an AI system takes us one step closer to this goal. By enabling the AI to continuously update its understanding of human preferences and adapt its behavior accordingly, we can create systems that are responsive to individual needs and values.

As we continue to develop and deploy AI, keeping these principles at the forefront will be key to ensuring a future where artificial intelligence is not just intelligent, but also profoundly human-compatible. The path ahead is challenging, but with tools like Bayesian modeling and a commitment to human values, we can build an AI future that truly benefits us all.