ARM: fix 32x32->64 multiply. The former version split the inputs into 16-bit portions, used four mul instructions as 16x16->32 multiplies, and reassembled the pieces. (It got it wrong, too, corrupting the condition codes but not saying so.) Instead, just use umull, which is a 32x32->64 multiply which the ARM documentation says is present in all versions (and works on my test host, reassuringly).