Writing a (toy) symbolic interpreter, and solving challenges, part 3

Rédigé par Simon Marechal - 28/07/2021 - dans Outils - Téléchargement
In this installment, we turn the concrete interpreter into a symbolic interpreter. How exciting!

This is a series, please check part 1 and part 2 if you have not done already or fast forward to part 4!

The game plan

We currently have a concrete interpreter for a subset of WASM. In this episode, we will start by making the variables (potentially) symbolic, only to find out there are parts of the interpreter that do not work as expected! After the problems are fixed, we will think about how to run all execution paths in parallel.

At every step, we are going to proceed in a very Haskelly way, by thinking about the type of the function or abstraction we will need, pretending it is possible to implement it (potentially using place holders). Only once we figure out the exact type or API we need will this be implemented.

So do not worry if things that have never been defined are being used! We will defined them properly only later, when we know exactly what they should do.

Going symbolic

All values are currently concrete. As a reminder, a concrete integer is something like 5, whereas a symbolic integer would be an arbitrary expression that, if computed, would produce an integer. For example, a symbolic integer could be "(x + 3) - y", and depend on the value of two variables, x and y. If we knew these values, we could compute the result, but if we do not, for example because they represent undetermined user input, we must keep them in this form.

We would like to make the memory and stack values potentially symbolic, but without choosing a symbolic representation yet. Right now, the memory is represented as a map from integers to bytes, and values are:

data VNum
  = VI32 Word32
  | VI64 Word64
  | VF32 Float
  | VF64 Double

These are concrete values. Let us imagine we have a datatype that represent symbolic expressions, and call it Symbolic x, where the x is the type of the resulting of the expression. If we were to create a good symbolic representation, values would be stored as:

data VNum
  = VI32 (Symbolic Word32)
  | VI64 (Symbolic Word64)
  | VF32 (Symbolic Float)
  | VF64 (Symbolic Double)

Note that the symbolic values are still typed. However, doing this would only allow us to use our interpreter for this Symbolic representation, and we would have to duplicate the code for every other implementation (concrete, symbolic, SMT, ...).

The solution is to keep the underlying representation parametric:

data VNum f
  = VI32 (f Word32)
  | VI64 (f Word64)
  | VF32 (f Float)
  | VF64 (f Double)

Here f is a type constructor1. By working like that, we can reuse the same code for many representations. In the case of the Symbolic representation, we would use VNum Symbolic. For now, let us pretend we will be able to write a good Symbolic data type, and start refactoring the interpreter code based on that new VNum representation. Indeed, the code does not compile anymore!

Resolving values

By mechanically updating the code, we run into a problem, with the parts of the interpreter that deal with memory access and conditional branching:

  If (Inline Nothing) tb fb -> do
       res <- toBool <$> popV
       use (wframe . finstrs) >>= push . Cont
       wframe . finstrs
         .= if res then tb else fb

This code will not type check. Previously, popV would return a plain VNum, and toBool was just:

toBool :: VNum -> Bool
toBool v =
  case v of
    VI32 x -> x /= 0
    VI64 x -> x /= 0
    VF32 x -> x /= 0
    VF64 x -> x /= 0

But now, for example in the case of VI32, instead of a plain Word32 that we can easily compare, we get an f Word32, about which we know nothing! But we really need to know if it is equal to 0 or not, so that we know if we must run the "true block" or the "false block" of the if condition.

The problem is identical for memory accesses, as we do not want to handle symbolic memory accesses, but need concretes addresses.

We thus need a function that can turn an f Bool into a Bool we can use. A good solution in Haskell, is to not worry about how we might achieve this goal, but what should the solution look like. Ideally, we would need a function that looks like:

resolveBool :: f Bool -> Bool

However, this would not work in the general case, as Haskell being a pure language, it would mean we can always extract a boolean value without any side effect. However, in our use case, we already know that we must find a way to say "let's try it with a true value, and then let's try another path with the false value". The correct solution is to assume this will be possible in a given monad, and use:

resolveBool :: f Bool -> m Bool

Where f is the symbolic representation, and m the monad our interpreter runs in. We will think later about how to achieve the desired effects. However, this function can't possible be written for any f and any m. In order to indicate that this only makes sense for a set of f/m pairs, we will create a type-class (remember, those are like traits in Rust, and interfaces in Java):

class RMonad f m where
  -- | resolve a boolean value
  resolveBool :: f Bool -> m Bool

  -- | tentatively resolve an arbitrary value
  resolve :: f a -> m (Maybe a)

The R in RMonad is for "resolve". It allows us to do this:

-- now, toBool has a type of this shape:
-- toBool :: VNum f -> f Bool

decodeInstr :: (MonadError String m, MonadState (WState f) m, MonadReader Module m, RMonad f m) => Instruction Natural -> m ()
decodeInstr i =
  case i of
    If (Inline Nothing) tb fb -> do
      fres <- toBool <$> popV
      res <- resolveBool fres
      use (wframe . finstrs) >>= push . Cont
      wframe . finstrs
        .= if res then tb else fb
    ...

Note that we added the RMonad constraint, which means we can use resolveBool and resolve in the function body. However, once we fix the code, we run into another problem!

An expression type

The following snippet has several problems now:

   IRelOp BS32 opr -> do
      b <- pop32
      a <- pop32
      res <- case opr of
        IEq -> pure $ a == b
        INe -> pure $ a /= b
        ILeU -> pure $ a <= b
        _ -> throwError "Unsupported comparison"
      push $ Value $ VI32 $ if res then 1 else 0

The comparison operations were defined for expressions of type Word32, but are not for arbitrary (f Word32). The solution is again type classes. We can define a type class that holds all the operations we might need:

class Symb f where
  inject :: a -> f a
  (.+:) :: (Ord a, Num a) => f a -> f a -> f a
  (.-:) :: (Ord a, Num a) => f a -> f a -> f a
  (.*:) :: (Ord a, Num a) => f a -> f a -> f a
  (.^:) :: (Ord a, Num a, Bits a) => f a -> f a -> f a
  (.&:) :: (Ord a, Num a, Bits a) => f a -> f a -> f a
  (.|:) :: (Ord a, Num a, Bits a) => f a -> f a -> f a
  (.>>:) :: Bits a => f a -> f Word8 -> f a
  (.<<:) :: Bits a => f a -> f Word8 -> f a
  rotl :: Bits a => f a -> f Word8 -> f a
  rotr :: Bits a => f a -> f Word8 -> f a
  (.==:) :: Eq a => f a -> f a -> f Bool
  (./=:) :: Eq a => f a -> f a -> f Bool
  (.<=:) :: Ord a => f a -> f a -> f Bool
  u32tou8 :: f Word32 -> f Word8
  u8tou32 :: f Word8 -> f Word32
  i32tou32 :: f Int32 -> f Word32
  u32toi32 :: f Word32 -> f Int32
  oneif :: (Ord a, Num a) => f Bool -> f a

That way, we can rewrite the code like:

decodeInstr :: (MonadError String m, MonadState (WState f) m, MonadReader Module m, Symb f, RMonad f m) => Instruction Natural -> m ()
decodeInstr i =
  case i of
    IRelOp BS32 opr -> do
      b <- pop32
      a <- pop32
      res <- case opr of
        IEq -> pure $ a .==: b
        INe -> pure $ a ./=: b
        ILeU -> pure $ a .<=: b
        _ -> throwError "Unsupported comparison"
      push $ Value $ VI32 $ oneif res
    ...

There are several items of interest in the implementation choice:

  • we added a Symb constraint to our function;
  • the oneif (returns 1 is the inner expression is true, 0 otherwise) function is not really required, as we can always "resolve" a boolean value with the resolveBool function we defined just above, and push on the stack either 1 or 0 depending on the result. This will however make a huge difference when actually running the code, as it will limit the amount of branches;
  • the inject functions let us push concrete values into the symbolic type, which is useful for constants in our program;
  • all conversions between integer types are made explicit, so as to avoid surprises when, for example, sign extension would be required.

Running a concrete interpreter with taint tracking

As the changes to the decodeInstr function have been minimal, we would like to check it still works by making it run the concrete interpreter. However, we are also going to add some taint tracking to make things more interesting! Here is the corresponding data type:

data Tainted a = Tainted
  { _content :: a, -- actual value
    _taint :: S.Set String -- set of taints that are applied to this value
  }

Of course, this is, again, a toy. The interpreter is general enough to support multiple, unintended, execution models, but as they are unintended, there are many optimizations that can't be applied. For example, the result of the multiplication of a tainted value with 0 is still tainted, or a memory access that is indexed by a tainted value will not be tainted (unless the accessed memory is tainted).

The relevant code is very much straightforward, and can be followed here.

Running an actual symbolic interpreter

Now for the piece de resistance, how to make our interpreter run parallel paths? The first part is creating a data structure that can hold symbolic values. It can be seen here, and it is not particularly interesting. It is a direct encoding of the Symb interface, with an operator precedence aware pretty printer, a very simple interpreter and simplifier.

Now that it is out of the way, how can we run different parallel paths, with our interpreter? As a recap, our goal here is that we run the code linearly, until we hit a situation where the control flow changes based on a boolean value that can't be resolved to a concrete boolean value. When this happens, we will run both paths, noting down the chosen value for each path. We will describe the process one could follow to reach that goal. If you are not interested in the decision process details, you can safely skip to the next section.

Designing the monad

The interpret function has the following signature:

interpret
  :: (MonadError String m, MonadState (WState f) m, MonadReader Module m, ...)
  => m [VNum f]

The key insight is that we can choose our monad m, which means we can choose the kind of computation the function does, without modifying it. We plan to use a generic m, that can work for other architectures, and that has the following features:

  • given the function dependencies, the monad m must have a modifiable state, a read-only structure, and be able to throw string errors ;
  • it must be able to go through several paths of execution, while remembering which path has been taken ;
  • on the technical side, it must be a Monad, which means it also must be a Functor and an Applicative.

There are several ways to do that, with different compromises with regards to complexity of implementation, performance, type inference, and so on. For this article, we will do something very basic, that is not recommended in production (for Haskellers, one should exploit some free monad library to do that in practice), by directly encoding the interfaces we want our m to exhibit, by creating a type with a constructor for each function we will need to implement.

For example, in order to have a modifiable state of type s, we simply look at the interface we need to implement:

class Monad m => MonadState s m | m -> s where
  get :: m s
  put :: s -> m ()

Our data type would be:

data M s a where
  Get :: M s s
  Put :: s -> M s ()

Now if we want to add reader:

class Monad m => MonadReader r m | m -> r where
  ask :: m r
  local :: (r -> r) -> m a -> m a

Our type becomes:

data M s r a where
  Get :: M s r s
  Put :: s -> M s r ()
  Ask :: M s r r
  Local :: (r -> r) -> M s r a -> M s r a

By repeating this process for all the constraints, we end up with this data type. The only part that took a bit of reflection was the last constructor, for running though several execution path:

Multi :: (Enum a, Bounded a, Eq a, Show a) => f a -> MP f r s a

The semantics of this constructor is that we expose a symbolic value of type f a, where f is the symbolic type, and a the underlying type, and run an execution path for each possible value a could take (this is possible because the type variable a represents something that is enumerable and bounded). Of course, while this is pleasantly generic, this will only work properly with types with very few values, such as booleans.

Next, one can write all the required instances, and finally check that it is possible to run our interpreter this way, by making sure the following type checks:

interpret :: (Symb f, Evaluable f) => MP f Module (WState f) [VNum f]

Designing the interpreter

Now that we have a monad with all the required characteristics, we want to write an interpreter. We know that we need the read-only value and an initial state as arguments, so our interpreter should have a signature that looks like:

evalMP :: r -> s -> MP f r s a -> ????

Now the output should have the following characteristics:

  • it should return a list of items, where each item represents a distinct path;
  • for each path, it should return all the decisions that have been taken. These decisions can be represented as a symbolic value too;
  • for each path, it should either return an error, or the result along with the updated state.

The type should be really close to this:

evalMP :: r -> s -> MP f r s a -> [([f Bool], Either String (s, a))]

Once this is decided, writing the evaluator is very mechanic, with only slight changes to this signature. The main difficulty resides in interpreting the Multi constructor:

Multi var ->
  [ ((var .==: inject a) : constraints, Right (s, a))
  | a <- [minBound .. maxBound]
  ]

We basically return a list, where each item corresponds to a possible value for var. In this item, the actual selected value is returned, and a constraint is appended. In the case of booleans, with no initial constraints, we would have:

[ ( [var .==: inject False], Right (s, False) )
, ( [var .==: inject True],  Right (s, True) )
]

Running the symbolic interpreter

For this phase, I slightly modified the program so that it uses our symbolic interpreter, and so that each ? character is converted to a symbolic value. Let's test it with concrete values:

$ stack run /path/to/keygenme.wasm username password
----
constraints
result
0

So, there are no constraints, and the result is 0, which means a failure. Now let's try to inject a symbolic value:

$ stack run /path/to/keygenme.wasm username 'password?'
----
constraints
u32(a18) == 0
result
0
----
constraints
u32(a18) != 0
result
0

The last character is a symbolic value that has been involved only once in branching decisions, as there are two paths. It is represented as the a18 variable. The program puts the username and password into the interpreter memory, as a C-like string (terminated by a 0 byte). The number after the a is the memory address. The memory layout is that username starts at 0, and password starts at length username + 1.

The first path represents the case where our symbolic value is equal to 0, and the second when it is different from 0. Sadly, in both cases, the result is 0 :(

What if we try with more symbolic variables? In that cases, a lot of paths are traversed:

$ stack run /path/to/keygenme.wasm username '????????'
...
----
constraints
u32(a16) != 0
u32(a15) == 0
u32(a14) == 0
u32(a13) == 0
u32(a12) == 0
u32(a11) == 0
u32(a10) == 0
oneif(u32(a10) == 0) | oneif((u32(a10) - 57) != 0) != 0
result
oneif((u32(a10) - 57) == 0)
...

This case clearly returns 0, as the variable a10 must be equal to 0, which means that u32(a10) - 57 will not be equal to 0. We are really close to solving the first part of this challenge, as will be demonstrated in the next episode!

Conclusion

We managed to run the challenge with symbolic variables, but unfortunately, the results are huge and hard to interpret. However, all the hard work is now finished, and the solution is in sight. In the next episode, we will write a better simplifier to solve part 1, and wire an SMT solver to solve part 2 of the challenge. Stay tuned!

  • 1. That means it is indexed by another type. For example, in most typed languages, lists are indexed by the type of their elements. Haskell is a bit more powerful than most languages here, as you can abstract over the type of the constructor. For example, if we write a function that has a parameter f Int, it can be a list of integers, a promise that returns an integer, etc.