I'm trying to optimize a code that has modular multiplication, to use SIMD auto-vectorization. That is, I don't want to use any libraries, the compiler should do the job. Here's the smalles verifiable example I could get:

#[inline(always)]
fn mod_mul64(
    a: u64,
    b: u64,
    modulus: u64,
) -> u64 {
    ((a as u128 * b as u128) % modulus as u128) as u64
}

pub fn mul(a: &mut [u64], b: &[u64], modulo: u64){
    for _ in (0..1000).step_by(4) {
        a[0] = mod_mul64(b[0], a[7], modulo);
        a[1] = mod_mul64(b[1], a[6], modulo);
        a[2] = mod_mul64(b[2], a[5], modulo);
        a[3] = mod_mul64(b[3], a[4], modulo);
        a[4] = mod_mul64(b[4], a[3], modulo);
        a[5] = mod_mul64(b[5], a[2], modulo);
        a[6] = mod_mul64(b[6], a[1], modulo);
        a[7] = mod_mul64(b[7], a[0], modulo);
    }
}

#[allow(unused)]
pub fn main() {
    let a: &mut[u64] = todo!();
    let b: &[u64] = todo!();
    let modulo = todo!();
    mul(a, b, modulo);
    println!("a: {:?}", a);
}

As seen on https://godbolt.org/z/h8zfadz3d even when optimizations are turned on and the target CPU is native, there's no SIMD instructions, which shoud start with v for vector.

I understand that this mod_mul64 implementation may not be SIMD-friendly. What should be an easy way to modify it so it gets SIMD-ed automatically?

Source: View source