Smaller Vec allocation in serialization results in slower code

There is a binary serialization called BorshSerialize. And in its Rust implementation, when alloc is enabled, it allocates a Vector with initial capacity of 1024 before each serialization. I thought to myself, I can do this better by allocating the exact number of bytes to avoid any extra bytes overhead and resizing for bigger structs. Here is my attempt:

use borsh::BorshSerialize;

pub trait FastBorshSerialize: BorshSerialize + BorshSize {
    fn fast_serialize(&self) -> Vec<u8> {
        let mut buf = Vec::with_capacity(self.borsh_size());
        self.serialize(&mut buf).expect("Serialization must not fail");
        buf
    }
}

impl<T> FastBorshSerialize for T
where
    T: BorshSerialize + BorshSize
{
}

pub trait BorshSize {
    fn borsh_size(&self) -> usize;
}

The idea is, on top of this, I implement the BorshSize for all base types (u8, u16, [u8; 32], Vec) etc. And then implement a derive macro to automatically implement the BorshSize trait by calling borsh_size method on all of its fields and sum them up. IMO this should be optimized quite a lot by compiler since it will most likely be inlined, and basic-types sizes are just constant values (u32 => 4). I already did this and it is in this repo. Not sharing it here because I think it is irrelevant.

I wrote a micro-benchmark to compare my implementation to regular borsh::to_vec implementation and ran it with cargo test --release:

#[derive(Default, BorshSerialize, BorshSize)]
struct Strukt {
    a: u32,
    b: u64,
    c: [u8; 32],
    d: Vec<u8>,
}

#[test]
fn t() {
    let mut s = Strukt::default();
    s.d = vec![5; 900];
    let r1 = borsh::to_vec(&s).unwrap();
    let r2 = s.fast_serialize();
    assert_eq!(r1, r2);
    dbg!(r1.len());
    dbg!(r1.capacity());
    dbg!(r2.len());
    dbg!(r2.capacity());

    let n = 100000;

    let start = std::time::Instant::now();
    for _ in 0..n {
        s.fast_serialize();
    }
    println!("elapsed fast: {}", start.elapsed().as_micros());
    let start = std::time::Instant::now();
    for _ in 0..n {
        borsh::to_vec(&s).unwrap();
    }
    println!("elapsed borsh: {}", start.elapsed().as_micros());
}

Here is the output I get:

---- t stdout ----
[fastborsh/tests/serialize.rs:19:5] r1.len() = 948
[fastborsh/tests/serialize.rs:20:5] r1.capacity() = 1024
[fastborsh/tests/serialize.rs:21:5] r2.len() = 948
[fastborsh/tests/serialize.rs:22:5] r2.capacity() = 948
elapsed fast: 4864
elapsed borsh: 2683

I calculated the required capacity correctly and allocated exactly that, but still got almost 2x worse results. I know this is a micro-benchmark in the end but I still wouldn't expect consistent 2x worse result despite fast version allocates smaller size. I am quite sure that the rest of the serialization logic is the same.

I just went and changed the s.d = vec![5; 900]; line to s.d = vec![5; 1000];. New output is:

[fastborsh/tests/serialize.rs:19:5] r1.len() = 1048
[fastborsh/tests/serialize.rs:20:5] r1.capacity() = 2048
[fastborsh/tests/serialize.rs:21:5] r2.len() = 1048
[fastborsh/tests/serialize.rs:22:5] r2.capacity() = 1048
elapsed fast: 2917
elapsed borsh: 10089

Not only it beats the borsh original implementation (as expected especially borsh::to_vec now has to resize its buffer), it also beats itself when it allocated 900 elements instead of 1000.

Anyone have any idea why this could be the case :)?

If you want to replicate the benchmark check out this.

I tested on Macbook M3 Chip 16GB RAM.

Source: View source

Smaller Vec allocation in serialization results in slower code

Read Next

في ذكرى رحيل الصباحي.. فارس الحب والألم

A Visitors Guide to Tianmen, China