I recently spent some time finalizing the design of the pointer types in Flect. The resulting pointer system borrows some concepts and syntax from Rust but is very different semantically.
Most programming languages that have some form of heap memory management use one of the following approaches to differentiate data on the heap and data that is in-place (i.e. on the stack):
- Types are declared to either be by value or by reference. Types that are
passed by reference have more capabilities than types that are passed by
value. Value types can be boxed to get reference semantics. This is C# and
- Only primitive types that are built into the language are passed by value. All other types are passed by reference. Primitive types can be boxed to get reference semantics. This is Java and Scala.
- All types are on the heap. The question of whether a type is passed by value or by reference is not actually relevant due to immutability. This is Erlang and Elixir.
- All data is passed by value by default. It must be explicitly placed in the heap and must be passed around with pointers to gain reference semantics. This is C and C++.
All of these clearly have advantages and disadvantages. They all make various assumptions about how programmers are going to write code and manage their data in a particular language.
The first approach is sensible in a language with high focus on object-oriented programming while also allowing lightweight types for things like dates, time spans, vectors, tuples, etc. It has the weakness that a type declared to be passed by reference can never get by-value semantics and vice-versa. In other words, it assumes fairly naive memory management. It is also worth noting that value types aren’t first-class (in the sense that they don’t get full OO capabilities).
(C# does allow passing things as
out parameters, but this does not
mean first-class reference semantics for value types.)
The second approach is really just a variation of the first one with the difference that the programmer cannot declare custom value types. The assumption here is that object orientation will be used heavily and primitive types are just passed by value because anything else would be too slow. This approach needs a very good garbage collector to work in practice.
(Having said that, there have been talks about turning even the primitive types into objects in a future version of Java. Make of that what you will…)
The third approach is actually very simple, conceptually. Since things that have been created can never be changed, it doesn’t ‘matter’ how data is passed around; all we can do is examine it, either way. This makes it very easy to think about most problems, but makes some problems (think graph algorithms) harder to solve. This approach also requires a very clever garbage collector.
(The fact that data is immutable can be a significant advantage to a GC - it can avoid needlessly scanning heap data multiple times, for instance.)
The last approach is quite possibly the most flexible. The programmer has full control over how things are passed around. This, however, is also the most error-prone approach because raw pointers can trivially be mixed up with GC pointers.
We can do better. In Flect, we can get the best of all worlds. Mostly.
First of all, we have three distinct pointer categories:
@pointers: These are safe, managed pointers that point to GC boxes. They rely on the garbage collector, but can never be used in a way that results in a segmentation fault (outside of
*pointers: These are plain old C-style pointers and are as unsafe as they sound. They can be passed around and manipulated (think pointer arithmetic) in safe code, but can only be dereferenced in
&pointers: These are referred to as generalized pointers. A generalized pointer can point to both managed and unsafe memory. They carry mostly carry the same guarantees as
@pointers and are meant to be the bridge between safe and unsafe memory.
The first two categories are fairly straightforward. For instance,
be an integer stored on the GC heap.
*int would be a pointer to an integer
somewhere in external memory.
The interesting pointer category is
&. Unlike what the symbol may suggest, it
is not equivalent to a reference in C++. Other than the fact that you cannot
& pointers (you must go through
*) it is a
first-class pointer category.
& pointers exist because most code can be
written without making any assumptions about where data lives.
For example, we might write some code that computes the dot product of two three-dimensional vectors:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vec3_dot shouldn’t need to care about where it gets its data from,
as long as it gets some data. So we can make it work both on
pointers by changing its declaration to:
1 2 3
Note that data in Flect has by-value semantics if not wrapped in a pointer
category or passed by
ref. So if you have no reason to put something on the
heap, you don’t have to.
As mentioned previously,
@ pointers are always safe and never require an
unsafe region to be used.
* pointers can be safely passed around and
manipulated but require an
unsafe context to be dereferenced.
So what does an
& pointer guarantee?
The truth is, it guarantees slightly less safety than an
@ pointer. An
pointer is always guaranteed to point to valid memory (unless tinkered with in
unsafe block, but if you do that, you’re asking for trouble). But since
* pointers can easily point to invalid memory, so can
& pointers. Now, the
language makes a reasonable effort to ensure that a cast from e.g.
&int is safe by inserting a null check. This will catch the most common
errors but still allows invalid memory to slip through.
& pointers are considered safe. This must seem odd when considering
that accessing one can result in a segmentation fault. There are two reasons
why they are safe:
- Practicality. If
&pointers were unsafe, it’s probably reasonable to assume that either ~98% of Flect code would be marked
unsafeor programmers would not use
&pointers at all. Worse yet, they might label the language too impractical to work with. I know I would.
- If you mess around with
unsafeblocks, you’re mostly on your own. As said, the compiler will try to help by inserting null checks when casting
&, but the assumption is that if an
unsafeblock executes or any
*pointer is used during the lifetime of a Flect process, memory integrity of the entire process is jeopardized anyway.
In short, a trade-off between practicality and perfect safety has been made.
That being said, tools like AddressSanitizer can be used to instrument Flect code in order to find invalid memory accesses.
A number of conversions are allowed between the various pointer categories:
unsafecontext. Results in a pointer that points into the managed header of the source pointer. Can easily blow things up if not used with care.
&T: Safe conversion. Performed implicitly wherever an
@Tis fed to an
&Tdestination. Results in a pointer offset by the managed header size.
unsafecontext. Assumes that the source pointer points to the start of a managed header. This is the inverse of
*Tand is just as dangerous. Null check is inserted.
&T: Mostly safe conversion (see previous section). Assumes that the source pointer points to valid memory. Null check is inserted.
unsafecontext. Assumes that the source pointer points to the first byte after the managed header. Subtracts the managed header size from the pointer.
unsafecontext. Assumes that the source pointer was originally a raw pointer.
The last two conversions in particular are dangerous and in the general case
should not be used. Code that operates on
& pointers cannot really know
whether the pointer was originally an
@ pointer or a
The conversions between
* are meant to allow hacking on ABI-level
things, should this be necessary. It may seem insane to allow this, but Flect
is a systems language, so this kind of stuff needs to be possible.
One important aspect of the way Flect’s pointer types work is that, even though managed and raw pointers can be mixed (as generalized pointers), the GC can still make reasonably precise decisions about reachability.
Since Flect emits RTTI, the GC can trivially figure out what fields in a heap structure contain possible GC pointers. It can decide how to scan a field based on its pointer category:
@pointer: A field of this pointer category is guaranteed to point directly at the managed header of the relevant object. It can be scanned precisely, and the GC can freely dereference the pointer (it is guaranteed to never be
*pointer: A field of this pointer category either contains a pointer allocated via some non-GC memory manager, a
nullvalue, or it contains a managed pointer that has been converted to a raw pointer. It can be scanned precisely (but must go through a null check and an object lookup in case the field contains a
nullor non-GC pointer).
&pointer: A field of this pointer category either contains a pointer to some sort of non-GC memory, or it contains a pointer to the interior of a managed box (offset exactly by the managed header size). The GC must do a lookup of
ptr - headerto scan it.
Note that the rule for
* pointers does not allow interior pointers to exist
in the heap. This means that addresses obtained via the
& operator (of e.g.
a structure field) must be kept on the stack to keep the containing object
alive if no other strong references to it exist. This is one reason that this
Thread stacks and registers are scanned fully conservatively to allow advanced pointer arithmetic scenarios.
Flect’s type class system doesn’t actually use pointer types to provide method
dispatch. Instead, the
this value (of type
self) is simply passed with
ref which has similar semantics to the
ref in C#. Thus, Flect’s pointer
types have no impact on abstraction capabilities.
1 2 3 4 5 6 7 8 9 10 11 12 13
(I’ll talk more about type classes in a later post.)
The TL;DR is that Flect has two primary pointer categories - one that represents GC-managed memory, and one that represents raw/external memory. A generalized pointer category which can be obtained from both of those exists to make writing generic code easier.
The different categories have increasing levels of safety;
* is the least
& is moderately safe, and
@ is perfectly safe. The language reuqires
unsafe context for most unsafe constructs.
The system complicates garbage collection slightly, but I think it’s an acceptable trade-off to make mixing managed and raw memory practical.
Type classes are entirely unaffected by this design, which means that programmers can use whatever pointer category makes the most sense for a task without worrying about losing type classes.
Hopefully, this system is flexible enough in practice and doesn’t make too many assumptions about where programmers want to keep their data.