alexrp’s blog

ramblings usually related to software

Recent Improvements to Mono on ARM

As part of my ‘make Mono awesome on ARM’ quest, I’ve made some improvements to Mono recently, which may be of interest to anyone using Mono on ARM boards or embedded devices.

All of these features have landed in Git, though not all of them are present in the latest release. If you need/want these features, you will need to build the master branch in Git.

Build System Sanitation

Mono’s build system has traditionally done some fairly insane things when the target is set to ARM.

We used to detect the target FPU by attempting to execute a program using VFP instructions in the configure process. While this works fine on native ARM hardware, it doesn’t work at all when cross-compiling. It is also not what most people want because it completely ignores what FPU the GCC toolchain is set up to use (so it could easily screw things up for Linux distro package maintainers for instance).

We now detect the target’s FPU by using preprocessor logic that goes somewhat like this:

1
2
3
4
5
6
7
#if defined(__ARM_PCS_VFP)
/* -mfloat-abi=hard */
#elif !defined(__SOFTFP__)
/* -mfloat-abi=softfp */
#else
/* -mfloat-abi=soft */
#endif

It looks a bit convoluted, but it’s necessary due to the way GCC behaves with the various FPU configurations. Also, we have a special case for iOS because the GCC shipped with it doesn’t follow the logic that the regular GCC does. On iOS, we always assume -mfloat-abi=softfp because all iOS devices have a VFP unit.

(For an explanation of the floating point ABIs, see the section on dynamic VFP further into the article.)

Another problem we had when detecting the FPU is that we simply executed plain gcc to do so. This is very wrong because the compiler executable name can be different – especially when cross-compiling – and so we could end up querying the host compiler instead of the target compiler. This is now fixed.

In a similar vein to the old FPU detection, we used to detect ARM v6+ by executing a small program that uses an ARM v6 mcr instruction. Again, this is very hostile to cross-compilation and also completely disregards the toolchain configuration. We now check the various __ARM_ARCH_...__ preprocessor symbols defined by the compiler. For example, __ARM_ARCH_6ZK__ means ARM v6, while __ARM_ARCH_7A__ means ARM v7, and so on (there are many variations).

Finally, various explicit -DHAVE_ARMV6 and -DARM_FPU_... preprocessor arguments that we passed depending on the target triple have been removed in favor of the sanitized detection of target FPU and ARM version.

Improved Hardware Feature Detection

It used to be that Mono would do a couple of checks via /proc/cpuinfo to get an idea of what ARM version is in use, and that’s about it. And this was only done on Linux.

First of all, using /proc/cpuinfo to detect hardware features is always problematic because it doesn’t work under QEMU. That file is generated by the host kernel, so if you’re running Mono in an ARM chroot by using QEMU and the host kernel is, say, an x86 kernel, the values Mono will find will seem like complete garbage.

So, instead, Mono now uses the Linux auxiliary vector to detect hardware features. This is the /proc/self/auxv file. And don’t try to cat it (it’s binary) - use LD_SHOW_AUXV=1 /bin/true instead. Using the auxiliary vector is better because QEMU gets a chance to decide what information is provided. If you execute the aforementioned command on an x86 machine, you’ll see something like this:

1
2
3
4
5
alexrp@Zor ~ $ LD_SHOW_AUXV=1 /bin/true
...
AT_HWCAP:        bfebfbff
...
AT_PLATFORM:     x86_64

Now if we switch to an ARM chroot using QEMU, we’ll get this:

1
2
3
4
root@Zor ~ # LD_SHOW_AUXV=1 /bin/true
...
AT_HWCAP:    swp half thumb fastmult fpa vfp thumbee neon
...

Two things are worth noting:

  1. No AT_PLATFORM entry is present.
  2. The AT_HWCAP entry has ARM-specific info.

Ideally, QEMU would have provided an AT_PLATFORM entry saying something like v7 (as would be the case on real ARM hardware), but not providing it at all also works in that Mono will just be conservative and only generate ARM v4 code.

The AT_HWCAP entry is also provided by QEMU and usually is consistent with what hardware features QEMU is capable of emulating, so we can safely look at it and see that it supports e.g. VFP and therefore generate VFP code.

Another thing that’s been improved is that Mono now detects the ARM version on iOS too. This means that a Mono binary compiled for ARM v6 that executes on v7 or v7s hardware will use v7 and v7s instructions. Similarly, a v7 binary will use v7s instructions if executed on a v7s device. This has always been the case on Linux and Android - we just do it on iOS too, now.

Finally, a new MONO_VERBOSE_HWCAP environment variable has been added to Mono that makes it print the hardware features it has detected:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
zor@rpi ~/mono $ MONO_VERBOSE_HWCAP=1 runtime/mono-wrapper -V
mono_hwcap_arm_is_v5 = 1
mono_hwcap_arm_is_v6 = 1
mono_hwcap_arm_is_v7 = 0
mono_hwcap_arm_is_v7s = 0
mono_hwcap_arm_has_vfp = 1
mono_hwcap_arm_has_thumb = 1
mono_hwcap_arm_has_thumb2 = 0
Mono JIT compiler version 3.3.0 (master/e0f965f Fri Jul 19 10:42:35 CEST 2013)
Copyright (C) 2002-2012 Novell, Inc, Xamarin Inc and Contributors. www.mono-project.com
        TLS:           __thread
        SIGSEGV:       normal
        Notifications: epoll
        Architecture:  armel,vfp+fallback
        Disabled:      none
        Misc:          softdebug
        LLVM:          supported, not enabled.
        GC:            sgen

This is useful for finding out whether Mono is making full use of your hardware.

Dynamic Use of Vector Floating Point

On ARM, there are two software floating point ABIs: soft and softfp. The former ABI uses full software emulation for all floating point operations, and passes floating point values in core registers and on the stack. The latter adheres to the soft ABI but uses Vector Floating Point (VFP) instructions to perform floating point computations, resulting in a significant performance boost.

Previously, if Mono was compiled for the soft ABI, it would always use software floating point, even if the hardware it was executed on actually had a VFP unit.

You might wonder why that’s a problem in practice - surely people will just do the right thing and compile Mono for the ABI they need, right? It turns out that it’s not that simple. For example, Debian’s armel distribution ships with a GCC that is configured for the soft ABI. Similarly, all packages are compiled for that ABI. Most people don’t think about this, and end up with a Mono that performs worse than it has to. Another example is Android. Here at Xamarin, we ship two versions of Mono compiled for ARM in the Xamarin.Android product: One for ARM v5 using the soft ABI and one for ARM v7 using the softfp ABI. The v5 build will be used on v5 and v6 devices and even if they have a VFP unit, it won’t be used. Only Android for ARM v7 guarantees that a VFP unit is present.

This is now fixed. The improved feature detection described above, combined with an overhaul of the way the JIT treats software floating point targets, means that we can now detect whether the hardware provides a VFP unit and use it if it does, even if Mono is compiled for the soft ABI.

The way this actually works is that Mono now always assumes that the hardware has VFP, even if compiled for the soft ABI. During initialization, Mono checks if the host hardware actually has a VFP unit, and if not, falls back to the software floating point code paths. That is to say, software floating point is now a ‘second-class citizen’ that is only compiled in as a fallback mechanism in case no VFP unit could be found. Note, however, that Mono compiled for softfp will not contain a fallback (it would make no sense, since Mono itself wouldn’t run on soft targets).

Initially assuming that VFP is present may seem like a somewhat odd way to do things, but as it turns out, the vast majority of ARM devices have VFP, so it’s actually a fairly sane default - like how we assume all x86 devices in this day and age have an FPU even though it’s not actually guaranteed on i386 machines per the Intel manual.

Atomics and Architecture Version

ARM is a bit of a mess as far as SMP support goes. The first architecture version to have proper SMP support was ARM v6. ARM v7 added some new convenient instructions like dmb to issue a memory barrier (which was only possible via the somewhat convoluted mcr instruction in v6). However, v4 and v5 did not have any support for SMP at all. What this means in practical terms is that, in the past, if you compiled Mono for ARM v4 or v5 and then executed it on an SMP system with multiple cores, everything would blow up because Mono wouldn’t know how to do atomic operations since it’s compiled for an architecture version that just doesn’t have them.

Generally, the problem here is that it’s very common to compile Mono for an old version of the ARM architecture and execute it on much newer hardware. For example, as I mentioned earlier, the GCC included in Debian’s armel distribution targets ARM v4 by default. Another example is the Android NDK, which targets v5 by default. In practice, most ARM hardware today is v6 or v7. There are a couple of Android devices out there that are actually v5 (believe it or not), but Mono will work fine on those.

So the solution to this problem is to do something similar to what we do for VFP: Detect the actual architecture version we’re running on and use newer SMP instructions if available.

Previously, Mono’s atomics looked something like this:

1
2
3
4
5
6
7
8
gint32 InterlockedAdd (gint32 *addr, gint32 val)
{
#if defined(HAVE_ARMV6) || defined(HAVE_ARMV7)
    /* v6/v7 code... */
#else
    /* v4/v5 code... */
#endif
}

Since HAVE_ARMV6 and HAVE_ARMV7 are compile-time things, this would result in the v4/v5 code being used if the compiler was configured for anything below ARM v6.

What we want is this:

1
2
3
4
5
6
7
8
9
10
11
12
gint32 InterlockedAdd (gint32 *addr, gint32 val)
{
#if defined(HAVE_ARMV6) || defined(HAVE_ARMV7)
  /* v6/v7 code... */
#else
  if (mono_hwcap_arm_is_v6 || mono_hwcap_arm_is_v7) {
      /* v6/v7 code... */
  } else {
      /* v4/v5 code... */
  }
#endif
}

(We still have the #if there because we don’t want a pointless branch on ARM v6 and v7 where we always have SMP instructions available.)

We could have easily implemented this in Mono, but as it turns out, GCC has a bunch of convenient intrinsics that already do that for us! Those are the __sync_* functions, which you can read about here.

Although it isn’t clearly documented, GCC makes sure that the code it emits for those functions works on both the target ARM version and all newer ARM versions. So on ARM v6 and up, GCC will just generate the obvious code using the native instructions available in that architecture version. For any older architecture version, it delegates to various helpers in the kernel such as __kuser_cmpxchg and __kuser_memory_barrier which sit at 0xffff0fc0 and 0xffff0fa0 respectively. These functions are compiled for the actual ARM version the hardware is running, since they are part of the kernel itself, so they can do the right thing for the architecture version that is actually in use. These are provided as part of the Linux kernel’s vDSO interface and are therefore fairly efficient to call.

For the above reasons, Mono uses GCC’s __sync_* intrinsics for atomics on most targets today (x86, ARM, PowerPC, MIPS).

Conclusion

The TL;DR of all of the above is:

  • configure.in now respects toolchain configuration with regards to target architecture version and target FPU, and invokes the correct compiler executable to detect this information.
  • Hardware feature detection is now done via the Linux auxiliary vector instead of /proc/cpuinfo, so Mono works under QEMU. Also, ARM version detection is now done on iOS too. A new MONO_VERBOSE_HWCAP environment variable has been added to print hardware feature information.
  • The JIT will now actively make use of a VFP unit even when compiled for systems that don’t have one. This results in significantly better floating point performance on systems that don’t yet use the hard ABI.
  • Mono will now work properly on SMP-capable ARM systems even when compiled for non-SMP architecture versions such as ARM v4 and v5.

Just build from the Git master branch to get all of the above.

But what about hard float support?

We (Xamarin) are aware that hard float support in Mono is very important for platforms like the Raspberry Pi and, generally, all new ARM boards.

Hard float support is coming. In fact, I’m working on it as I publish this article, so it shouldn’t take long before it lands in master.