alexrp’s blog

ramblings usually related to software

Flect: The Three Pointer Types

I recently spent some time finalizing the design of the pointer types in Flect. The resulting pointer system borrows some concepts and syntax from Rust but is very different semantically.

Introduction

Most programming languages that have some form of heap memory management use one of the following approaches to differentiate data on the heap and data that is in-place (i.e. on the stack):

  1. Types are declared to either be by value or by reference. Types that are passed by reference have more capabilities than types that are passed by value. Value types can be boxed to get reference semantics. This is C# and
  2. Only primitive types that are built into the language are passed by value. All other types are passed by reference. Primitive types can be boxed to get reference semantics. This is Java and Scala.
  3. All types are on the heap. The question of whether a type is passed by value or by reference is not actually relevant due to immutability. This is Erlang and Elixir.
  4. All data is passed by value by default. It must be explicitly placed in the heap and must be passed around with pointers to gain reference semantics. This is C and C++.

All of these clearly have advantages and disadvantages. They all make various assumptions about how programmers are going to write code and manage their data in a particular language.

The first approach is sensible in a language with high focus on object-oriented programming while also allowing lightweight types for things like dates, time spans, vectors, tuples, etc. It has the weakness that a type declared to be passed by reference can never get by-value semantics and vice-versa. In other words, it assumes fairly naive memory management. It is also worth noting that value types aren’t first-class (in the sense that they don’t get full OO capabilities).

(C# does allow passing things as ref and out parameters, but this does not mean first-class reference semantics for value types.)

The second approach is really just a variation of the first one with the difference that the programmer cannot declare custom value types. The assumption here is that object orientation will be used heavily and primitive types are just passed by value because anything else would be too slow. This approach needs a very good garbage collector to work in practice.

(Having said that, there have been talks about turning even the primitive types into objects in a future version of Java. Make of that what you will…)

The third approach is actually very simple, conceptually. Since things that have been created can never be changed, it doesn’t ‘matter’ how data is passed around; all we can do is examine it, either way. This makes it very easy to think about most problems, but makes some problems (think graph algorithms) harder to solve. This approach also requires a very clever garbage collector.

(The fact that data is immutable can be a significant advantage to a GC - it can avoid needlessly scanning heap data multiple times, for instance.)

The last approach is quite possibly the most flexible. The programmer has full control over how things are passed around. This, however, is also the most error-prone approach because raw pointers can trivially be mixed up with GC pointers.

We can do better. In Flect, we can get the best of all worlds. Mostly.

Pointer Categories

First of all, we have three distinct pointer categories:

  • @ pointers: These are safe, managed pointers that point to GC boxes. They rely on the garbage collector, but can never be used in a way that results in a segmentation fault (outside of unsafe regions, anyway).
  • * pointers: These are plain old C-style pointers and are as unsafe as they sound. They can be passed around and manipulated (think pointer arithmetic) in safe code, but can only be dereferenced in unsafe regions.
  • & pointers: These are referred to as generalized pointers. A generalized pointer can point to both managed and unsafe memory. They carry mostly carry the same guarantees as @ pointers and are meant to be the bridge between safe and unsafe memory.

The first two categories are fairly straightforward. For instance, @int would be an integer stored on the GC heap. *int would be a pointer to an integer somewhere in external memory.

The interesting pointer category is &. Unlike what the symbol may suggest, it is not equivalent to a reference in C++. Other than the fact that you cannot directly create & pointers (you must go through @ or *) it is a first-class pointer category. & pointers exist because most code can be written without making any assumptions about where data lives.

For example, we might write some code that computes the dot product of two three-dimensional vectors:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
pub struct Vec3 {
    pub x : f32;
    pub y : f32;
    pub z : f32;
}

pub fn vec3_dot(v1 : @Vec3, v2 : @Vec3) -> f32 {
    v1.x * v2.x + v1.y * v2.y + v1.z * v2.z;
}

pub fn test() -> f32 {
    let v1 = @Vec3 { x = 1.0, y = 2.0, z = 3.0 };
    let v2 = @Vec3 { x = 3.0, y = 2.0, z = 1.0 };
    vec3_dot(v1, v2);
}

Clearly, vec3_dot shouldn’t need to care about where it gets its data from, as long as it gets some data. So we can make it work both on @ and * pointers by changing its declaration to:

1
2
3
# ...
pub fn vec3_dot(v1 : &Vec3, v2 : &Vec3) -> f32 {
# ...

Note that data in Flect has by-value semantics if not wrapped in a pointer category or passed by ref. So if you have no reason to put something on the heap, you don’t have to.

Safety Guarantees

As mentioned previously, @ pointers are always safe and never require an unsafe region to be used. * pointers can be safely passed around and manipulated but require an unsafe context to be dereferenced.

So what does an & pointer guarantee?

The truth is, it guarantees slightly less safety than an @ pointer. An @ pointer is always guaranteed to point to valid memory (unless tinkered with in an unsafe block, but if you do that, you’re asking for trouble). But since * pointers can easily point to invalid memory, so can & pointers. Now, the language makes a reasonable effort to ensure that a cast from e.g. *int to &int is safe by inserting a null check. This will catch the most common errors but still allows invalid memory to slip through.

Still, & pointers are considered safe. This must seem odd when considering that accessing one can result in a segmentation fault. There are two reasons why they are safe:

  • Practicality. If & pointers were unsafe, it’s probably reasonable to assume that either ~98% of Flect code would be marked unsafe or programmers would not use & pointers at all. Worse yet, they might label the language too impractical to work with. I know I would.
  • If you mess around with unsafe blocks, you’re mostly on your own. As said, the compiler will try to help by inserting null checks when casting * to &, but the assumption is that if an unsafe block executes or any * pointer is used during the lifetime of a Flect process, memory integrity of the entire process is jeopardized anyway.

In short, a trade-off between practicality and perfect safety has been made.

That being said, tools like AddressSanitizer can be used to instrument Flect code in order to find invalid memory accesses.

Conversions

A number of conversions are allowed between the various pointer categories:

  • @T to *T: Requires unsafe context. Results in a pointer that points into the managed header of the source pointer. Can easily blow things up if not used with care.
  • @T to &T: Safe conversion. Performed implicitly wherever an @T is fed to an &T destination. Results in a pointer offset by the managed header size.
  • *T to @T: Requires unsafe context. Assumes that the source pointer points to the start of a managed header. This is the inverse of @T to *T and is just as dangerous. Null check is inserted.
  • *T to &T: Mostly safe conversion (see previous section). Assumes that the source pointer points to valid memory. Null check is inserted.
  • &T to @T: Requires unsafe context. Assumes that the source pointer points to the first byte after the managed header. Subtracts the managed header size from the pointer.
  • &T to *T: Requires unsafe context. Assumes that the source pointer was originally a raw pointer.

The last two conversions in particular are dangerous and in the general case should not be used. Code that operates on & pointers cannot really know whether the pointer was originally an @ pointer or a * pointer.

The conversions between @ and * are meant to allow hacking on ABI-level things, should this be necessary. It may seem insane to allow this, but Flect is a systems language, so this kind of stuff needs to be possible.

Garbage Collection

One important aspect of the way Flect’s pointer types work is that, even though managed and raw pointers can be mixed (as generalized pointers), the GC can still make reasonably precise decisions about reachability.

Since Flect emits RTTI, the GC can trivially figure out what fields in a heap structure contain possible GC pointers. It can decide how to scan a field based on its pointer category:

  • @ pointer: A field of this pointer category is guaranteed to point directly at the managed header of the relevant object. It can be scanned precisely, and the GC can freely dereference the pointer (it is guaranteed to never be null).
  • * pointer: A field of this pointer category either contains a pointer allocated via some non-GC memory manager, a null value, or it contains a managed pointer that has been converted to a raw pointer. It can be scanned precisely (but must go through a null check and an object lookup in case the field contains a null or non-GC pointer).
  • & pointer: A field of this pointer category either contains a pointer to some sort of non-GC memory, or it contains a pointer to the interior of a managed box (offset exactly by the managed header size). The GC must do a lookup of ptr - header to scan it.

Note that the rule for * pointers does not allow interior pointers to exist in the heap. This means that addresses obtained via the & operator (of e.g. a structure field) must be kept on the stack to keep the containing object alive if no other strong references to it exist. This is one reason that this operator requires unsafe context.

Thread stacks and registers are scanned fully conservatively to allow advanced pointer arithmetic scenarios.

Type Classes

Flect’s type class system doesn’t actually use pointer types to provide method dispatch. Instead, the this value (of type self) is simply passed with ref which has similar semantics to the ref in C#. Thus, Flect’s pointer types have no impact on abstraction capabilities.

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
pub struct MyStruct {
    pub x : i32;
}

pub trait Stringifiable {
    fn to_str(ref this : self) -> str;
}

pub impl Stringifiable for MyStruct {
    fn to_str(ref this : MyStruct) -> str {
        this.x.to_str();
    }
}

(I’ll talk more about type classes in a later post.)

Conclusion

The TL;DR is that Flect has two primary pointer categories - one that represents GC-managed memory, and one that represents raw/external memory. A generalized pointer category which can be obtained from both of those exists to make writing generic code easier.

The different categories have increasing levels of safety; * is the least safe, & is moderately safe, and @ is perfectly safe. The language reuqires unsafe context for most unsafe constructs.

The system complicates garbage collection slightly, but I think it’s an acceptable trade-off to make mixing managed and raw memory practical.

Type classes are entirely unaffected by this design, which means that programmers can use whatever pointer category makes the most sense for a task without worrying about losing type classes.

Hopefully, this system is flexible enough in practice and doesn’t make too many assumptions about where programmers want to keep their data.