ARM: fix 32x32->64 multiply.

The former version split the inputs into 16-bit portions, used four mul
instructions as 16x16->32 multiplies, and reassembled the pieces.  (It
got it wrong, too, corrupting the condition codes but not saying so.)
Instead, just use umull, which is a 32x32->64 multiply which the ARM
documentation says is present in all versions (and works on my test
host, reassuringly).