đź‘» Souls without bodies, phantom types shenanigans đź‘»

Rédigé par Simon Marechal - 26/04/2024 - dans Outils - Téléchargement

In this article, we will present strange data types that only exist in the realm of types, called phantom types. We will also briefly introduce GADTs, and how to emulate some of their safety guarantees in languages where they are not available. This simple technique can go a long way towards making APIs safer and more expressive.

What can go wrong with EDSLs ?


Let's say we want to write an embedded domain specific language (EDSL) in Rust. This embedded language, without surprise, has an expression type. We would like to have at least the following features for our expression type:

  • A literal expression, that can be an unsigned integer, signed integer, or boolean,
  • a variable,
  • the sum of two expressions,
  • a "lower than" comparison between two expressions.


If we were to model that naively, we could write in Rust:

enum Literal {
    I(i64),
    U(u64),
    B(bool),
}

enum Expr {
    Lit(Literal),
    Var(&'static str),
    Add(Box<Expr>, Box<Expr>),
    Lt(Box<Expr>, Box<Expr>),
}

This is a straightforward implementation, but there is a problem with it. All literals are of the same type, Literal, and all expressions are of the same type, Expr. This means that we could write nonsensical expressions (at least in our language, this might make sense in JavaScript :) that way, such as:

Add(
  Box::new(Lit(Literal::I(12))),
  Box::new(Lit(Literal::B(true))),
)

In a perfect world, we would restrict sums for expressions that represent integers of the same type. It would be a shame to allow this, as one of the main point of using an embedded DSL is to reuse the host language facilities, such as type checking.

Solving the problem with GADTs

GADT stands for generalized algebraic data types, and are something that you can use in some languages of the ML family, such as Haskell. If you know Rust, it is a bit like enums, except each variant can have its own specific type. So, in Haskell, the previous naive definition can be written as:

data Expr a where
  Lit :: a -> Expr a
  Var :: Variable a -> Expr a
  Add :: Expr a -> Expr a -> Expr a
  Lt :: Expr a -> Expr a -> Expr Bool


It might be a bit difficult to read for people unfamiliar with Haskell, but the main difference with the previous Rust version is that we have an additional type parameter, a. It means that the Expr type carries an extra bit of information which is the expression type. This type represents the kind of literal this expression evaluates into.

This is a lot more type-safe than the Rust version because:

  • literal expressions are typed with the literal type,
  • as variables are typed too, the type of the expression is identical to the type of the variable,
  • when adding two expressions, they must be of the same type, and the result is also of the same type,
  • when comparing two expressions, they must be of the same type, and the result is a boolean expression.

We can't really write this in Rust, but in a Rust-like language, it could look like:

enum Expr<A> {
  Lit(A),
  Var(Variable<A>),
}
enum Expr<A: Numerical> {
  Add(Expr<A>, Expr<A>)
}
enum Expr<bool> {
  Lit<X: Ord>(Expr<X>, Expr<X>)
}

More type safety with more code

One way to increase type safety would be to take the previous "Rust-like" expression and monomorphize it. For example, we could have:

enum UIntExpr {
    Lit(u64),
    Var(UIntVariable),
    Add(Box<UIntExpr>, Box<UIntExpr>),
}

enum IntExpr {
    Lit(i64),
    Var(IntVariable),
    Add(Box<IntExpr>, Box<IntExpr>),
}

enum BoolExpr {
    Lit(bool),
    Var(BoolVariable),
    LtU(UIntExpr, UIntExpr),
    LtI(IntExpr, IntExpr),
}

This is indeed much safer, as we can no longer write nonsensical expressions. It is also not that hard to understand. However, this comes at the cost of a lot of code duplication. Not only the type definitions will be duplicated, but all the code that manipulates them will also be. In particular, we will need three interpreter functions, one for each kind of expression. The code of these interpreters will also have a lot of duplication.

While the downside is quite bad in our simple example, in real world situation it will be a lot worse, so much that this solution only really makes sense when few types are allowed.

Introducing phantom types

Phantom types are types associated to a data structure that does not hold elements of that type. The reason they are named this way is that only the spirit (the type) of the data is there, but not its body (value). In a language like Haskell, phantom types are nothing special, for example a typed variables can be declared as:

data Variable a = Variable String

In the variable body, there is only a string, which is the variable name. But in its soul (type), there is an a type variable that does not correspond to any value in the variable definition. This type represents the type of the variable. This is useful so that we can "mark" some variables as representing integer variables, for example.

In Rust, the same can be done, but we need to use a special zero-sized data type, PhantomData:

pub struct Variable<A> {
    pub name: &'static str,
    tp: PhantomData<A>,
}

The tp member is only here to hold the A type, but has no existence at runtime. In other words, it will not be represented in memory. We can now create a builder for this data structure that can create variables of any type:

impl<A> Variable<A> {
    fn new(name: &'static str) -> Self {
        Self {
            name,
            tp: PhantomData,
        }
    }
}

While we can create a variable of any type, the main idea here is that once it has been created, its type cannot change. We can then write code that works on any type of variable, but have them all separate so as to avoid, for example, using a boolean variable where an integer was expected.

Using phantom types to solve our problem

We can now use this tool to solve our problem by creating an internal representation that is not type safe, with an outer shell that we type using a phantom type:

pub enum InnerExpr {
    Lit(Literal),
    Var(&'static str),
    Add(Box<InnerExpr>, Box<InnerExpr>),
    Lt(Box<InnerExpr>, Box<InnerExpr>),
}

pub struct Expr<A> {
    inner: InnerExpr,
    tp: PhantomData<A>,
}

We take care of not exposing the content of Expr<A>, so that it is not possible to build an Expr<A> from another module. We can then expose type safe builders:

impl<A> Expr<A> {
    /// this method is private, and will not be exposed to code outside this module
    fn from_inner(inner: InnerExpr) -> Self {
        Self { inner, tp: PhantomData }
    }
    /// expressions built from variables of type A will also be of type A
    pub fn var(v: Variable<A>) -> Self {
        Expr::from_inner(InnerExpr::Var(v.name))
    }
    /// compared expressions must be of the same type, and the resulting expression is of type bool
    pub fn less_than(self, rhs: Self) -> Expr<bool> {
        Expr::from_inner(InnerExpr::Lt(Box::new(self.inner), Box::new(rhs.inner)))
    }
    /// Added expressions and the resulting expressions are all of the same type
    pub fn add(self, rhs: Self) -> Self {
        Expr::from_inner(InnerExpr::Add(Box::new(self.inner), Box::new(rhs.inner)))
    }
}
// builders for literal values
impl Expr<bool> { pub fn bool(b: bool) -> Self { Expr::from_inner(InnerExpr::Lit(Literal::B(b))) } }
impl Expr<i64> { pub fn int(i: i64) -> Self { Expr::from_inner(InnerExpr::Lit(Literal::I(i))) } }
impl Expr<u64> { pub fn uint(i: u64) -> Self { Expr::from_inner(InnerExpr::Lit(Literal::U(i))) } }

We can now build expressions like:

fn mkexpr(var: Variable<u64>) {
  let a = Expr::uint(12);
  let b = Expr::int(4);
  let c = Expr::var(var);
  let _ = a.add(b); // does not type check, because a and b are of different types
  let d = a.add(c); // does type check!
}

Discussion

By using phantom types, and only exposing invariant-preserving constructors, we managed to increase the type safety of our expression type without increasing the amount of code necessary to process the expression types. This is not as satisfying as having real GADTs, as we need to create two types, write all the builder functions, and make sure that invariants are satisfied instead of relying on the compiler.

However, it strikes a nice balance between type safety and amount of code when compared to the naive, unsafe, version and the safe but verbose solution of declining each structures with every type it can take.

This solution has been illustrated in Rust, but it will work in any language with parametric types where phantom types can be implemented.