I'm trying to optimize a code that has modular multiplication, to use SIMD auto-vectorization. That is, I don't want to use any libraries, the compiler should do the job. Here's the smalles verifiable example I could get:
#[inline(always)]
fn mod_mul64(
a: u64,
b: u64,
modulus: u64,
) -> u64 {
((a as u128 * b as u128) % modulus as u128) as u64
}
pub fn mul(a: &mut [u64], b: &[u64], modulo: u64){
for _ in (0..1000).step_by(4) {
a[0] = mod_mul64(b[0], a[7], modulo);
a[1] = mod_mul64(b[1], a[6], modulo);
a[2] = mod_mul64(b[2], a[5], modulo);
a[3] = mod_mul64(b[3], a[4], modulo);
a[4] = mod_mul64(b[4], a[3], modulo);
a[5] = mod_mul64(b[5], a[2], modulo);
a[6] = mod_mul64(b[6], a[1], modulo);
a[7] = mod_mul64(b[7], a[0], modulo);
}
}
#[allow(unused)]
pub fn main() {
let a: &mut[u64] = todo!();
let b: &[u64] = todo!();
let modulo = todo!();
mul(a, b, modulo);
println!("a: {:?}", a);
}
As seen on https://godbolt.org/z/h8zfadz3d even when optimizations are turned on and the target CPU is native, there's no SIMD instructions, which shoud start with v
for vector.
I understand that this mod_mul64
implementation may not be SIMD-friendly. What should be an easy way to modify it so it gets SIMD-ed automatically?
Source: View source