Alex Dixon

‘AI coding tools are powerful but we mustn’t let our own skills atrophy’

2026-03-20T00:00:00+00:00

It started with the “don’t get left behind” brigade, somehow they managed to convince me that I was now missing out on developing critical new skills. People who hadn’t put in the hours to learn to code in the first place were now at the top of the field and leaving the rest of us behind, who have put in multiple decades and tens of thousands of hours of effort to learn our craft. They want their PR’s merged upstream to get credit on GitHub for code they didn’t write or even understand, flooding the pull request system to breaking point. I tried to ignore it but the chatter was incessant and I had to take a look for myself.

I didn’t want the AI coding tools to be good. After all, I have dedicated a large part of my life to coding. It is much more than a job, it’s a hobby, a love, part of who I am. But coding is dead, it’s going to be over in 6 months. I’ve been having to hear this daily for the last 3 years and it is exhausting and inescapable, everywhere I look I see it. “The models will get better” they say, “this is the worst it will ever be”… and so it goes on and on and on. Waking up every day reading articles about how your job is going to be replaced is not good for mental health, when will we ever hear the end of the constant hype train?

I don’t want the AI hyperscaler tech bros to succeed. With the enshitification of technology steadily underway, I don’t trust a single one of them. Now they have plagiarised the entire sum of human knowledge they want to kick down the ladder, sue the competition, close it all off, we were here first. These advancements should be exciting, but knowing capitalism will certainly try its best to eventually squeeze every last penny out of it makes it hard for me to appreciate the moment we are living in.

That being said, these tools are impressive, sometimes they amaze me, sometimes they frustrate me, and I dislike the landscape surrounding it all at the same time. The dichotomy is hard to explain, I didn’t want to get hooked on wanting to use them, but after just one session where my friend first showed me Claude code, I was dreaming about it, the thoughts of using AI infiltrated my mind. I was somehow addicted already.

I work with ML researchers, and have dipped my toe in there myself a little. A while ago a colleague explained LLMs to me in a way that stuck: they’re very good at interpolation. Words are represented as vectors in a high-dimensional space, and the model learns patterns between them. Based on this prior insight I started with tasks that I thought would be simple and likely a success: adding new records store scrapers to my music app; filling out some missing parts of a graphics engine backend; and clearing some long overdue technical debt. Even though I expected these tasks to be trivial, I was still impressed with how Claude worked, how it asked questions to clarify details, and how it seemed to understand exactly what I wanted.

I was sucked in even further. I paid for a subscription despite saying I hated big tech and would not pay them a penny… What a hypocrite.

It’s hard to measure productivity, but I can say that AI has given me a motivational boost to get back into projects. Sometimes knowing tech debt exists in a project is enough to slow me down. Asking AI to clean that up while I do the interesting stuff feels like a weight lifted off my shoulders. Coming back to the code base after a while, things can feel unfamiliar and AI is great at helping to ease you back into things as well, explaining the current state of something that was a work in progress. I was really amazed at issues found when I asked Claude to review some of my code bases. It found subtle bugs from just looking at the code, I fixed them and together added tests to catch those cases. The code review was so useful for my Rust maths library that I decided to publish it as version 1.0.0. These aspects of AI coding augment my skills, make light work of the boring stuff and also help me make sure every detail is covered meticulously.

The ease of development is powerful but it can also be a double edged sword. Since it is so easy to ask to fix something or add a new feature you can easily end up with a lot of features and a lot of code. More code is not better, more code is bad, it is more to maintain and it increases the complexity of any future work. Being able to pile on features without thought dilutes our ability to discern the most impactful meaningful changes. Good software engineers tend to naturally optimize in this regard, because it means less work. Less work is good and laziness can make for smarter decisions.

The veneer can easily peel away when working on complex and abstract problems. I started to get into a rut with difficult tasks. Claude was struggling, I didn’t like the code it generated and it was taking a lot of time to not just read the code, but became detached from what I was trying to achieve. I started to realise that using an LLM to code completely changes our relationship to the code. This changing relationship has become prominent in a series I was livestreaming on YouTube entitled Sloppy Gamedev. The aim of this is to try and make a game using AI. I originally attempted to purely vibe code it, during the first session I realised Claude wasn’t going to be able to make a game on vibes alone. The first attempt was pretty terrible, lots of hardcoded values so it was not extendible or reusable and lots of bugs.

So you could say this is a skill issue, I need better prompts or better context, and if you had all of that information ahead of time, maybe a purely vibe coded game would be possible. Claude needs a lot of detail and I also need to figure out and understand what those details should be. With things like gamedev a lot of the problems require iteration and this is where I think things start to break down. What I need is a more collaborative relationship between my code, LLM code, and understanding of the architecture we need. I found this difficult at first because I did not like the LLM generated code, I did not want to edit it myself and felt alienated.

Sometimes the code it generates is just very anti-human, an example being that I noticed inline dot products and magnitude calculations with the code repeated verbatim each time, expanding scalar maths and not using the maths library functions. Things that have always been implicit now need to be explicit, and currently for me at least it’s very difficult to forecast everything ahead of time. We can tweak things in plan mode but after a long time of trying to refine a single plan I am itching to accept it, to see the parts that work and then that will allow me to iterate again. This is how I naturally work, write a small burst of code, test it, run it, tweak it, and continue. Claude generates a lot of code and quickly, accepting code that partially works to just see it in action leads to immediate tech debt.

Working on a crowd simulation, agents were getting stuck on corners. I asked Claude multiple times to fix it, it piled on more code, each time claiming to have fixed it. I found myself trapped in the prompt loop death spiral. I had to force myself to sit down at the computer with no help allowed and just figure it out myself. I spent a good few hours just drawing some debug geometry and fiddling around with the problem. This is where the realisation really hit me about what I was missing. The process itself is a crucial part of development, it’s not about the lines of code in the end, it’s about the intuition you build to get there. In this session not only did I improve the agents getting stuck on corners I also gained key insights on how to further improve it and how to parameterize and control the improvements. The code I wrote to do this was actually not great, it was messy and it was thrown away straight after, but that’s OK and that is part of the process. Somehow I had lost this ability when prompting Claude; it had changed me, frozen, unable to understand the code only able to prompt again and again. It took restraint not to just reach out and ask an LLM to do the thing for me, but in pushing past that barrier I was rewarded.

Since then I have been better able to guide Claude with a new found understanding of the problem. The important detail here is building intuition and knowledge, this worries me since I feel an element of skill atrophy when it’s so easy to just ask for help. I have already put in a lot of time learning the hard way, what chance do the newcomers have when people say there is no point in learning to code anymore? How are we able to steer the LLMs if we don’t understand the problems we need to solve? Claude’s attempt at gamedev was quite poor on its own, with guidance and collaboration it was much better, so for that reason I think learning to code and learning to understand the LLM generated code will always be an important skill.

If you liked this, check out my YouTube where I’m messing around with AI and check out my mostly handwritten repos :)

Porting diig from iOS to Android in less than 2 weeks

2026-01-28T00:00:00+00:00

I recently decided to port my music app ‘diig’ to Android since I had some requests from friends and other potential Android users. The app was originally designed and built for iOS using all of my own code and no external frameworks. The whole thing took around 2 weeks to port, not full days work but just a few hours here and there. 2 weeks to port an entire app, 99% feature parity with iOS.

I began work while on the train travelling back to visit my parents and decided to ‘raw dog’ some relaxing ‘casual coding’ on the way. This part of the process was about as relaxing as Super Hans’ infamous ‘relaxing bit of crack’ line in Peep Show. The code base for diig is already multi-platform since the backend uses my game engine pmtech. The engine already supported a number of platforms but not Android, however since Android is built on Linux, I already had a lot of functionality I could reuse from my Linux backend. I have a well tested OpenGL/WebGL and GLES rendering backend and FMOD for cross platform audio for the time being. I knew code wise there were only a few gaps that needed filling in.

Premake

Premake made getting started very easy. I think it is a very underrated and overlooked tool. I don’t know how CMake became the de facto gold standard for project generators. Premake makes project configuration easy in lua scripts, which I find more flexible than CMake.

The lua config setup allows you to specify project level, compiler and linker settings. With variables you can generically handle multiple platforms and configurations easily. My existing configs already had code paths for Win32, macOS, iOS and Linux with multiple rendering backends like Direct3D11, OpenGL and Metal. So a good portion of setting up Android was just plumbing through another combination of platform and config settings.

Premake outputs CMake and gradle build files. Any useful setting that is required in gradle files needs to be passed from premake. I had to add some new settings and propagate information set via premake into the gradle or CMake files. This is a necessary step, it makes it a little bit more painful in the setup but it serves you much more over time, because it ensures your projects can be generated the same on other machines.

The lua configs for multi-platform configuration were pretty easy to extend and add Android, although the code does feel a little bit spaghetti-like since platforms and features have been tacked on over a long period of time now. I decided just to add another bunch of stuff onto the Jenga tower and not get sidelined refactoring. I would like to rewrite the scripts and could do a neater job, but it doesn’t really add much value to what I am trying to achieve here.

Android SDK

The real pain of the whole process was the Android platform itself. It’s just quite fiddly to get it into a good state. The build system uses gradle, CMake and ninja. There is the SDK, the NDK (Native Development Kit) for C/C++ and the JDK for Java. All of these dependencies have their own versions and breaking changes happen frequently.

I thought first I would quickly plumb through a platform path for Android in my premake configs and get straight to fixing up compile errors. This was wishful thinking! I had to spend the best part of a 3-hour train journey fighting with all of the various parts of the Android build system to massage all the dependencies into a place where they worked together. Plus having to modify and remove deprecated functionality since I encountered breaking changes from a necessary SDK update.

I was doing this on a train with mobile hot-spot internet on my phone and ended up with a 50gb+ install of various dependencies. This is one problem that will not go away. It works as of now, but as time goes on this process repeats as versions of the various dependencies update. It’s not just a case of updating an SDK and then fixing the deprecated parts - you have multiple components, and you have to individually manage and then ensure they are compatible. You suffer for a day to fix it all up once every 6 months or at the time where you come back to the project after a while away.

Android Studio

Android Studio itself is not the nicest IDE or debugger, but it is handy to have a graphical debugger and not just debugging from the command line, so I suffer through the issues. I do find the interface to be very noisy, there are a lot of pop ups and dialogs, squiggles, indents, inline hints, and long error messages. As you are working you can see the intelli-sense update and the elements subtly shift and move in the UI. It infuriates me that the shortcut keys for debug stepping and continue are different to visual studio and vscode, and that all of the shortcuts feel different and non standard, this adds another layer of friction. I could spend time configuring it, but when you want to get stuck in to some work in a short time period you don’t want to waste time configuring keyboard shortcuts.

Feedback of error messages in Android Studio also seem to be way more verbose and annoying that any other IDEs, for example when you get a C++ compiler error, the end of the massage has tons of unnecessary verbal spew about java exceptions and you have to scroll back through the log to find the real errors. This makes things especially more stressful when you get confusing build errors you are un-familiar with. Logcat is also particularly stressful, it outputs so much information you have to sift through to find your own errors buried under a mountain of irrelevant info, if you filter it you worry you are missing some critical extra info.

Entry Point / Program Structure

For the entry point and core interaction with the SDK, Android uses Java or Kotlin code. Any C++ code needs to be compiled into a shared library. You can use an NDK only approach, but I am familiar with the Java setup having used it in the past, and it does make some things easier since the NDK and the C versions of the API are badly documented and there are more Java examples.

Android requires an Activity which represents the application flow. You implement methods such as onCreate or onPause and onResume. These are invoked by the OS when you start your app. I also implement a wrapper of the SurfaceView that handles the creation of an EGL context and OpenGL surface for rendering.

The engine consists of 2 C++ static libraries which are linked to the diig.so shared library. Android is a bit more awkward than other platforms because ordinarily you would build an executable that links the 2 other C++ libs. In this case the executable is Java and we load the C++ dynamically at launch.

Compiler Errors

The next step on a journey to porting is to get around any compiler errors. This part of the process is actually where I start to feel more comfortable. Mostly this is inside C++ files and most of the errors are expected or in my own code, which gives me agency to fix it.

I have had to do a lot of porting for work so this part comes quite naturally. First the project will need tweaking a bit, making sure include paths are set or all the right files have been added to compilation. Then I tend to ifdef out problematic code that I don’t currently need and look at it later, to focus on a small subset of the code base. I try to use Ifdefs for platform specific functionality sparingly and split things into file-per-platform. Some ifdefs go into shared code like OpenGL or the shared posix implementation. But once the project was set up, I didn’t have a great deal of legitimate compiler issues because of how much existing code was reused.

Linker Errors

Linker errors will occur for missing symbols that do not have an implementation for the Android platform. Most of these were to be expected and it is an easy fix. Here I just make a function stub for any of the missing symbols, that is an empty function that just returns a default value if applicable.

There were a few tricky linker issues to solve involving the audio system. FMOD has its own native libs and they need to be copied into a subdirectory of the Android studio project called jniLibs. To do this I added a copy step in premake, which copies the files during premake project generation. FMOD also requires some calls to loadLibs to load the C++ code and a call to FMOD_Android_JNI_Init. This took me a little time to figure out since I had cryptic error messages, but persistence always prevails and I got there in the end.

Development

Getting to this point took a good few days, but this was the fun part, or the place I wanted to be. With the code compiling and running it was time to slowly, one function at a time implement the missing functionality of the stub functions. I used these small examples as unit tests to isolate functionality.

The first step was to get the empty_project sample working that just logs something to the console, this required implementing the logging macro since printf does not display in logcat. After that it was a straightforward process to try rendering the basic_triangle to make sure OpenGL was working OK. I moved onto the imgui_example to make sure I could use the UI. play_sound to test FMOD, which is important since this is a music app. Finally input_example to hook in the input and touch events. Once these samples were working I had all of the core functionality for diig and the app should “just work”.

In total I ended up adding 871 lines of C++ code for the os module, 151 lines of C++ for Android filesystem related code, and 487 lines of Java code for the core activity. This was the bulk of it for the entire backend. Modifications were required in a few places for platform specific quirks in FMOD, OpenGL. There were also 100 or so lines of lua code for premake.

JNI

To interoperate between Java and C++ code the Java Native Interface is used. I have to use JNI to pass information from the Java side, such as touch and keyboard (OSK) events from Java where they originate, and through to the C++ code the rest of the app code base calls. I also have to interop in both directions. Calling C from Java is quite simple, you just need to use public static native. Going the other direction is a little bit more work:

void os_clear_clipboard_string()
{
    auto env = get_jni_env();
    if(env)
    {
        jmethodID method = env->GetMethodID(s_android_context.m_activity_class, "clearClipboardString", "()V");
        env->CallVoidMethod(s_android_context.m_activity_object, method);
    }
}

When calling Java from C++ you have to get the method by name, but also provide the signature for the types. Then there are a host of functions you can call such CallVoidMethod, CallBooleanMethod and so on for each type. It’s pretty simple but also easy to make a mistake and get the signature wrong or call the wrong typed call. It doesn’t take much effort but the “plumbing” adds up, so I try to have a minimal amount of these wrapper functions.

A Loose End

There is one loose end that is still to clean up even after writing this post. It annoyed me when I encountered it and it is just the way things are. When the app backgrounds on Android, upon return the whole app boots again from the start. I had to do this because upon background and return the EGL Context (OpenGL) is lost and that means all GPU resources need to be recreated. iOS does not have this same behaviour and the OS magically sorts it out for you. I was being lazy and just didn’t get round to thinking about a strategy for it yet. I have had to do this kind of thing before for my job. I just really cannot be bothered with this sort of menial work imposing itself on this project which I wanted to be light and fun, it all too soon starts feeling like a job and I suppose depending on how far I want to take it I will have to get round to fixing this, but for the time being I chose to ignore it.

Google Play Store

The final boss was automation and delivery to the Google Play store. First I had to pay £20 to actually set up an account. The one off fee is certainly better than Apple’s yearly £80 developer fee, but my bank blocked the transaction and I had to jump through some additional hoops to make sure I didn’t accidentally pay it twice. Then you have to do identity verification where I had to send my driving license, passport photo, a bank statement, and the blood of my first unborn child.

I went about setting up a GitHub action to automate the publishing. This Action is helpful for handling Google Play. I hooked it up, ran it and the upload failed. I persisted until I discovered my account was not verified yet, so I had to wait for a few days for that to happen. After verification I tried again, still failing to upload. It turns out you need to first push a build manually from the Google Play console, do that and run a build… still fails. At this point the error was regarding the json key not being valid to upload.

This was the most frustrating part of the entire process, the Google documentation was out of date. The way you enable auth or generate a token for Google Play upload had changed and the documentation had not. The error message was not very clear so I tried many times adding permissions to various accounts. I tried using Copilot, it gaslighted me time and time again. I persisted until I finally found this article that described the new steps necessary to generate a json key. And success, cloud based build and release with the push of a tag!

The automation to Google Play has been rock solid and reliable since using it, more so than the equivalent for iOS. At some point Apple started enforcing that build machines were registered and linked to your developer account to be able to push to TestFlight. This means you can’t use a cloud GitHub action runner and instead I need to use my own machine as a self-hosted runner. This adds extra admin and unnecessary friction to the whole process, whereas Android just looks after itself.

Onwards

Porting is a game of persistence, it can be a slog at times, but if you keep on persisting, fixing the errors one by one, starting small, building outward you always get there in the end.

It can be a bit of a rollercoaster, the dopamine hits hard when some existing code “just works” or it goes smoothly because of well planned abstractions that were set in place years ago, the feeling of relief when you manage to work around some obtuse error from dependencies you have never heard of… just to be hit by crushing anxiety at the new error that appears in its place.

After the initial steps of friction, the setup has been incredibly fun to use and all of the extra effort required for setting up and configuring something to be not just multi-platform but seamlessly multi-platform makes me able to just dip in and out and do little bits of work flexibly. I can work on my PC and target Android with a dual monitor and more desk space. Or on macOS and target Android or iOS on my laptop, on the go or more casually.

The diig app is available in closed beta for Android or iOS. If you would like to try it out please contact me for an invite.

Borrow checker says “No”! An error that scares me every single time!

2025-10-31T00:00:00+00:00

It’s Halloween and I have just been caught out by a spooky borrow checker error that caught me by surprise. It feels as though it is the single most time consuming issue to fix and always seems to catch me unaware. The issue in particular is “cannot borrow x immutably as it is already borrowed mutably” - it manifests itself in different ways under different circumstances, but I find myself hitting it often when refactoring. It happened again recently so I did some investigating and thought I would discuss it in more detail.

The issue last hit me when I was refactoring some code in my graphics engine hotline, I have been creating some content on YouTube and, after a little bit of a slog to fix the issue, I recorded a video of me going through the scenario of how it occurred and some patterns to use that I have adopted in the past to get around it. You can check out the video if you are that way inclined, the rest of this post will mostly echo what is in the video, but it might be a bit easier to follow code snippets and description in text.

I have a generic graphics API, which consists of traits called gfx. This is there to allow different platform backends to implement the trait; currently I have a fully implemented Direct3D12 backend and I recently began to port macOS using Metal.

The gfx backend wraps underlying graphics API primitives; in this case we are mostly concerned about CmdBuf which is a command buffer. Command buffers are used to submit commands to the GPU. They do things like draw_indexed_instanced or set_render_pipeline, amongst other things. For the purposes of this blog post, what the command buffer does is not really that important, just that is does do_something, which at the starting point when the code was working is a trait method that takes an immutable self and another immutable parameter ie. fn do_something(&self, param: &Param).

In the rest of the code base I have a higher level rendering system called pmfx. This is graphics engine code that is not platform specific but implements shared functionality. So where gfx is a low level abstraction layer, pmfx implements concepts of a View that is a view of a scene that we can render from. A View has a camera that can look at the scene and is then passed to a render function, which can build a command buffer to render the scene from that camera’s perspective. The engine is designed to be multithreaded and render functions are dispatched through bevy_ecs systems, so a view gets passed into a render system but it is wrapped in an Arc>.

I made a small cutdown example of this code to be able to demonstrate the problem I encounter, so let’s start with the initial working version:

use std::sync::Arc;
use std::sync::Mutex;

struct Cmd;

struct View {
    cmd: Cmd,
    param: Param
}

struct Param;

impl Cmd
{
    fn do_something(&self, param: &Param) {
        unimplemented!("");
    }
}

fn get_view() -> Arc<Mutex<View>> {
    unimplemented!();
}

fn main() {
    let view = get_view();
    let mut view = view.lock().unwrap();

    view.cmd.do_something(&view.param);
}

I tried to simplify it as much as possible so these snippets should compile if you copy and paste them, they won’t run thanks to unimplemented! macro (which I absolutely love using, it is so handy!) but we only care about the borrow checker anyway.

All we really need to think about is that a Cmd can do_something and it also gets passed in a Param, which is also contained as part of ‘view’. Coming from a C/C++ background I landed on my personal preference being procedural C code with context passing, so I tend to group things together into a single struct. It makes sense to me in this case and I wanted to group everything inside View, and we fetch the view from elsewhere in the engine.

So the code in the snippet compiles fine and I was working with this setup for some time. I began work on macOS and it turned out that the do_something method needed to mutate the command buffer so that I could mutate some internal state and make the Metal graphics API behave similarly to Direct3D12. This is common for graphics API plumbing.

The specific example in this case was that in Direct3D we call a function bind_index_buffer to bind an index buffer before we make a call to draw_indexed, but in Metal there is no equivalent to bind an index buffer. Instead you pass a pointer to your index buffer when calling the equivalent draw indexed. So to fix this, when we call bind_index_buffer we can store some extra state in the command buffer so we can pass it in the later call to draw_indexed.

In hindsight any method on the command buffer trait that does anything, like set anything or write into the command buffer, should take a &mut self because it is mutating the command buffer after all. In my case since I am calling through to methods on ID3D12CommandList, which is unsafe code and does not require any mutable references.

In our simplified example, in order to store, state do_something now needs to change and take a mutable self: do_something(&mut self, param: &Param) it should be noted that view itself was already mut.

impl Cmd
{
    fn do_something(&mut self, param: &Param) {
        unimplemented!("");
    }
}

fn main() {
    let view = get_view();
    let mut view = view.lock().unwrap();

    view.cmd.do_something(&view.param);
}

Borrow checker now kicks in…my heart sinks. In the real code base not only did I have to modify a single call site, but I had hundreds of places where this error was happening, I made the decision here and now to make any methods that write to the command buffer also be mutable and make the mutability

error[E0502]: cannot borrow `view` as immutable because it is also borrowed as mutable
  --> src/main.rs:30:28
   |
30 |     view.cmd.do_something(&view.param);
   |     ----     ------------  ^^^^ immutable borrow occurs here
   |     |        |
   |     |        mutable borrow later used by call
   |     mutable borrow occurs here

For more information about this error, try `rustc --explain E0502`.
error: could not compile due to 1 previous error

This is not the first time I have encountered this problem and I doubt it will be the last. There are a number of ways to resolve it and they aren’t too complicated. The frustrating thing is that it seems to occur always when you are doing something else and not just when you decide to refactor, so you end up having a mountain of errors to solve before you can get back to the original task. I suppose you could call it a symptom of bad design or lack of experience, but when writing code things inevitably change and bend with new requirements, and Rust throws these unexpected issues up for me more often than I find with C, and often the required refactor takes more effort as well. But that is the cost you pay, hopefully more upfront effort to get past the borrow checker means fewer nasty debugging stages later. So let’s look at some patterns to fix the issue!

Take

The one I actually went for in this case was using std::mem::take. We take the CmdBuf out of view so we no longer need to borrow a ‘view’ to use cmd, and then when finished return the cmd into ‘view’. It is important to note here that CmdBuf needs to derive default in order for this to work, as when we take the cmd in view will become CmdBuf::default()

#[derive(Default)]
struct Cmd;

// ..

fn main() {
    let view = get_view();
    let mut view = view.lock().unwrap();

    // take cmd out of view
    let mut cmd = std::mem::take(&mut view.cmd);

    // the immutable and mutable references are now split
    cmd.do_something(&view.param);

    // return the cmd into view
    view.cmd = cmd;
}

This approach is the simplest I could think of at the time because any existing code using view.cmd doesn’t need updating, everything stays the same and we just separate the references. In this case it was easy to derive the default for CmdBuf.You need to remember to set the cmd back on view here, which could be a pitfall and cause unexpected behaviour if you didn’t.

EDIT: Update

I posted this article to reddit and people kindly pointed out to me that the borrow cant split to individual fields because I was borrowing a MutexGuard of view and that access to the fields was going through the DerefMut trait. This simple line resolves my problem with no need for any other changes.

// now we have a mutable reference to view and not a MutexGuard
let mut view = &mut *v.lock().unwrap();

I can make excuses but ultimately I should’ve checked in more detail what view actually was a reference to. In my defence this code was inside an attribute macro and the rust analyser here wasn’t giving me any type hints, which in rust I find very useful and necessary. Additionally the DerefMut trait also abstracts this behaviour so to me it just looked like a reference to a view. I do feel foolish about this but hopefully the sentiment of this article still rings true. A bad decision in code of the past pops up at an inopportune moment and clouded my judgement on possible solutions. The other ideas in this post have still been useful in other scenarios, but an important step is to always double check what you are working with and what you think you are working with and not rush into any further bad decisions.

Clone

If you can’t easily derive default on a struct there are some other options. If the struct is clonable or you can easily derive a clone, you can clone to achieve a similar effect.

#[derive(Clone)]
struct Cmd;

// ..

fn main() {
    let view = get_view();
    let mut view = view.lock().unwrap();

    // clone cmd
    let mut cmd = view.cmd.clone();

    // the immutable and mutable references are now split
    cmd.do_something(&view.param);
}

Cloning might be considered a heavier operation than ‘take’ depending on the circumstances, but this method has the same benefit as the take version whereby unaffected code that is using cmd elsewhere doesn’t need to be changed.

RefCell

Another approach would be to use RefCell this allows for interior mutability and again we do not need to worry about default or clone.

use std::cell::RefCell;

struct Cmd;

struct View {
    cmd: RefCell<Cmd>,
    param: Param
}

fn main() {
    let view = get_view();
    let mut view = view.lock().unwrap();

    // borrow ref cell
    let mut cmd = view.cmd.borrow_mut();

    // the immutable and mutable references are now split
    cmd.do_something(&view.param);
}

Option (Take/Swap)

We also need to update any code that ever used view.cmd and do the same. Not ideal but it allows us to get around the need for a default or clone. I have had to resort to this in other places in the code base.

There are more options; quite literally Option here can help. If we make cmd an Option then this gives us the ability to use None as the default and we can use the std::mem::take approach. We can also use std::mem::swap and swap with None. Swapping works similar to ‘take’, where we take mem and swap with the default.

struct Cmd;

struct View {
    cmd: Option<Cmd>,
    param: Param
}

fn main() {
    let view = get_view();
    let mut view = view.lock().unwrap();

    // clone cmd
    let mut cmd = std::mem::take(&mut view.cmd);

    // the immutable and mutable references are now split
    cmd.as_mut().unwrap().do_something(&view.param);

    // return the cmd to view
    view.cmd = cmd;
}

The Option approach also requires more effort as we need to now take a reference and unwrap the option and update any code that ever used view.cmd to do the same. Not ideal, but it allows us to get around the need for a default or clone, and if your type is already optional then this will fit easily.

Interior Mutability

There is one final approach that could save a lot of time, and that would be to not change the do_something function at all in the first place. That is to keep it as do_something(&self, param: &Param). So how do we mutate the interior state without requiring the self to be mutable?

This can be done with RefCell in single threaded code or RWLock in multithreaded code. Since we already looked at RefCell I will do an example of RWLock.

struct Cmd {
    interior: Arc<RwLock<u32>>
}

impl Cmd
{
    fn do_something(&self, param: &Param) {
        // we now mutate the interior, locking and writing in a thread say way
        let interior = self.interior.try_write().and_then(|mut interior| {
            *interior = 1;
            Ok(())
        });
    }
}

fn main() {
    let view = get_view();
    let view = view.lock().unwrap();

    // code at the call site can stay the same as the original
    view.cmd.do_something(&view.param);
}

I decided to make the mutability explicit to the trait and that was based on how the command buffers are used in the engine, in other places I have taken other approaches favouring interior mutability. For this case a view can be dispatched in parallel with other views, but the engine is designed such that 1 thread per view and no work happens to a single view on multiple threads at the same time. Command buffers are submitted in a queue in order and dispatched on the GPU.

Here it made sense to me to avoid locking interior mutability for each time we call a method on a CmdBuf and it works with the engine’s design. We lock a view at the start of a render thread, fill it with commands and then hand it back to the graphics engineer for submission to the GPU. The usage is explicit, we just needed to appease the borrow checker!

I hope you enjoyed this article, please check out my YouTube channel for more videos or more articles on my blog, let me know what you think and if you have any other strategies or approaches I would love to hear about them. I would also like to hear about compiler and borrow checker errors you find particularly time consuming or frustrating to deal with.

diig - A music discovery app for record diggers

2025-10-12T00:00:00+00:00

diig is a music discovery app and the beginnings of a music platform that I started working on a few years ago. In that time I have been using the app myself, and so have a few friends, but I haven’t really announced much about it so I thought I would get some words down about the project. The name diig comes from the term crate digger which is given to record collectors who dig though vast quantities of records to find hidden gems.

The idea came from a frustration in my user experience of online record stores. Online record shops have audio snippets of the records they sell so you can browse and listen before you buy, obviously if you want to buy music it is usually best to know what it sounds like first (although I have been known to buy blindly, especially if it has a really cool sleeve or artwork!). The problem with these websites is that their music players are not perfect, consistency across different stores is variable, the desktop versions of websites usually perform much better than the mobile ones, and I just found the general UX of listening to snippets while browsing online just never felt how I wanted it to.

I am mainly interested in buying physical vinyl records. The main goal of digging is to go through a lot of music as quickly as possible to find obscure things that your friends might not know, buying in physical shops is nice especially if you have a store where they let you listen before you buy and the responsiveness of dropping a needle into a groove, skipping through tracks is really nice. But in the modern world with so much music being released all the time, online shopping is still a crucial part of record collecting for me and I have to buy some stuff online I won’t find anywhere else. The two can also compliment one another - it’s nice to have an idea of stuff you like in advance of visiting a physical store.

For the first part of the project my focus has been on a mobile app and I wanted to optimise that experience. So what constitutes feeling nice? And what am I trying to optimize for here? I found online record store players to be quite laggy with a good deal of latency between pressing play on a track and actually hearing the audio, we are talking fine margins here but if you want to listen through something like 400 snippets of audio in 10 minutes then the latency adds up. I also found the UI just not great for mobile, having to click instead of swipe just doesn’t help the experience and makes it feel clunky.

There was one place that I found to be really nice in terms of UX for browsing music snippets and this was on Instagram. Record labels would put up new releases and you could swipe right to go through the individual tracks. The problem is that the Instagram algorithm pollutes everything and you can’t curate a purely music only feed. So with this idea in mind, I was sure I could implement something similar to provide myself a niche app tailored perfectly to my use case.

Having worked on game engines, games and low level high performance systems I knew I had the skills to make the app. I also had the added boost and insights from the previous company I worked at where we made the live action branching narrative game Erica, which required low latency video and audio playback. Here I was dealing with streaming, buffering, and decoding audio and high definition video… I thought to myself just having to play some compressed mp3 snippets is going to be easy!

It was in fact easy, the iOS app itself was up and running in a couple of weekends. I also “rawdogged” pretty much all of this code, no LLM and no Copilot. I had a head start because I used my game engine pmtech to do all the graphics, os and low level stuff. The app is mostly C style C++ with some objective-C for iOS and I built the UI in ImGui. To get data into the app I am scraping info from my favourite record shops. The scrapers are written in python and use simple ad hoc parsing code which just extracts information about releases, mp3 and image links into a simple unified schema. The releases are uploaded to a firebase database where they can be fetched from the app and you can log in as a user to store your own likes and sync over different devices. Scrapers run nightly on a GitHub action and once a release has been populated on the initial scrape their availability is the only thing that needs updating, that is if it is available for preorder, available to buy in stock, or out of stock. The scrapers also track position information so that you can view things in a chart format as the record stores usually have a chart for each genre or category. I have an automated GitHub action which can be used to push updates to the app to iOS via TestFlight, but the app doesn’t need updating often and the data it pulls is all stored and updated in the cloud.

And that is about it, the main app and scraper ecosystem has been up and running for a few years and I have been using diig to help me browse for and discover new music. I recently added a new scraper to the project and put some videos on YouTube about that process in more detail. I plan on continuing work on this project and now have some ideas for more components to the diig platform. I would like to add Android support and also a web front end that can provide a different user experience. All of this stuff is currently in closed beta, if you’re interested in trying it out then please contact me. If you want to contribute the code is available on GitHub.

Maintaining CI is a pain in the…

2025-03-04T00:00:00+00:00

An ongoing source of frustration is maintaining continuous integration in open source hobby projects. It’s really useful to have continuous builds, automated tests and package delivery, but it comes with maintenance. Time will pass and the time will come where I want to tag a build in git and let all my lovely automated CI publish a package, or maybe work on a project I haven’t touched for a while and I want to run the tests, but for what feels like more often than not, the build fails for an unexpected reason.

The problem is that even if very little changes in the source code the CI often fails for various reasons out of your control. It takes a while to get back into the headspace of how the build is configured and start debugging a problem. It’s really annoying when you just want to spend time working on something new and fun and are now sweating on what was supposed to be a relaxing Saturday morning, trying to fix tests and areas of the code that you didn’t intend on looking at. You end up with the “fix CI” commit history of death as you push changes and wait to see the results on a cloud hosted runner.

There are various reasons as to why this happens. I’ve just gone through a frustrating ordeal with updating my iOS distribution certificates that expired recently and so prevented me from publishing a new build of my iOS app diig. The app beta expired so I stopped being able to use it; this happens every 60 days and I havent had to make any changes to the app itself for a while so 60 days expires and I have to push a new build. I haven’t released the app to the AppStore to make it publicly available because it’s something I’m just using personally, the 60 day limit in itself is annoying but having to do the yearly certificate and provisioning profile update is even more so. I always forget all of the things you need, so for my future self here is the rough run down:

First you need a development certificate and a distribution certificate, you can create new certificates on the Apple Developer website in the Certificates Identifiers & Profiles section. You need to create a certificate signing request which can be done through Keychain Access > Certificate Assistant > Request a Certificate from a Certificate Authority.

The certificates (.cer files) can be downloaded and then imported into the keychain and then exported as a .p12 file with a password. The password here is stored in GitHub Actions as a secret. The .p12 files can be encoded as base64: base64 -i dist.p12. The output in the console is copied into another secret. Here I have something along the lines of DEV_P12 and DIST_P12.

Next, a provisioning profile is required for both development and distribution that can be generated from the Certificates Identifiers & Profiles section as well. I created an iOS development profile and selected the development certificate, same for the iOS distribution profile.

The profiles are added to the repository (they should probably also be secret, but this was how the build was already set up) and copied into the ~/Library/MobileDevice/Provisioning Profiles folder on the build agent.

Finally, everything should build because the actions yml file does the file copying, the base64 decoding and all of that jazz. But I was wrong, the build was still failing. The error was that Xcode did not have a valid provisioning profile. OK, then maybe something was up with the certs or the profiles, I revoked them, generated them again and was extra careful about them making sure the right cert was named the right thing, the pasted secrets didn’t have any extraneous characters or mistakes. Try building again, same error. Maybe just redo the certs and profiles again? Just to be sure. Still the same problem!

At this point tagging builds (burned 5 tags) and pushing, waiting for the dreaded CI failure was annoying. So I decided to see what I could do locally on my machine to reproduce the issue more rapidly. The problem with this is that the keychain has working provisioning profiles that are managed by Xcode so I am able to build locally and that was why I didn’t try this sooner. I need to build on an external machine that has no such user account connected to Xcode.

I realised I was able to look in the ~/Library/MobileDevice/Provisioning Profiles folder and see the older stale profiles (from the last time I set this up). Ahh, I can delete those ones and see if I can reproduce the issue using the archive command line:

xcodebuild archive -workspace build/ios/diig_ios.xcworkspace -configuration Release -scheme diig -archivePath build/ios/diig_ios OTHER_CODE_SIGN_FLAGS="--keychain $KEYCHAIN_PATH" PROVISIONING_PROFILE="digiosdev" CODE_SIGN_STYLE="Manual" -verbose

Error: Xcode requires a valid provisioning profile.

But the profile digiosdev is clearly there in the folder so why does Xcode complain there is not a provisioning profile? Copilot was able to help me here and it suggested using PROVISIONING_PROFILE_SPECIFIER instead of PROVISIONING_PROFILE.

Problem solved. This took me a few hours on a Saturday morning before leaving to meet friends and then a further few hours the next day to fiddle around and get the build working again. I did all the certificate and provisioning profile stuff correctly the first time and it’s annoying that for some reason since updating the profile PROVISIONING_PROFILE_SPECIFIER was necessary for it to be picked up, maybe it could be due to an Xcode update? Apple has a tendency to change things a lot, deprecate APIs, make changes to signing and distribution, it’s painful to keep up at times.

But herein lies the crux of this all, even if you don’t change a thing yourself the world around you can change and that can cause build systems to suddenly fail.

This has happened to me countless times. Python environment setup has changed multiple times on different platforms and for different projects over the years. Things which cause my pip setup to fail so I hack around to find the working one. Could it be pip3 or python3 -m pip or py -3 -m pip, maybe using brew to install Python instead. I don’t know, just hack until it works again.

Android builds on Linux have been another immense pain in my android-studio project. Android also has a tendency to change a lot, you have a lot that goes into it: SDK, NDK, Gradle, Kotlin, Java, CMake and Ninja and even more build systems in there, all of these changing over time cause headaches, especially if you haven’t touched the thing for a year or something and somebody comes along with a small PR and all of a sudden the CI is broken. At one time I had to forcibly downgrade the Java version on the actions runner because it caused a known crash in the Android studio licensing agreement, this fixed it for a while but then the Java version I needed became unavailable to GitHub actions and I had to upgrade and find other fixes… thanks also to PR contributors on helping to maintain the CI on that project.

Another frustrating session of CI fixing came in my Rust graphics engine hotline. A Rust compiler update in conjunction with a particular bevy_ecs version started to cause a hard to diagnose crash in my tests. It only happened in the tests I couldn’t reproduce in a standalone build and also couldn’t reproduce in a single test. It was only when all tests ran (in single threaded) and eventually one would crash. I spent weeks on this, only half an hour or so after work but chipping away at it, trying to debug it and make sense of what was going on. I had particular difficulty because I had no symbols or callstack, I rolled back my code to a known working version where the tests passed and it was published to crates.io and it was still crashing. In the end the fix was to update bevy_ecs, which sounds straight forward but it took me a while to attribute it to bevy_ecs and updating required me to fix API breaking changes in my code, it was not simply a case of a version bump. Frustrating to spend a few weeks trying to fix these tests for an unrelated reason to what I wanted them to be used for, to help me implement new features without breaking existing functionality.

Another perplexing issue with a Rust project was when it began to fail compilation, even though no changes to code were made. The reason was that an external dependency had an updated version, this particular crate had been patched and patching only applies to a specific version of a crate. Since the version changed the patch was not applied and the unpatched version did not compile. This is where I discovered about explicit versioning in cargo and how even with a full version specifier cargo may try to change or update a dependency version to make the best fit within the cargo tree. In this case the solution was to commit the cargo lock file and user cargo build —frozen to make the CI more stable. Easy fix but unexpected symptoms always cause alarm at first.

Some conclusions I can draw from these scenarios: I could run the CI periodically so issues get caught sooner and not just when I am making changes, but that would cause the same kind of frustration and might even be worse for a hobby project where I would be alerted CI now is broken and now I know I have to fix it at some point, it might detract from other projects. Using custom docker images would help to lock down the versions of software running the builds, I don’t know much about that though so it’s something to look into. Cargo.lock proved a good solution to enforce stable versioning in Rust projects. At least there are some solutions to help improve reliability, but they don’t help with the issue such as PROVISIONING_PROFILE_SPECIFIER it threw me off totally, I was so close the first time to getting it right and this completely screwed me, macOS is constantly updating, forcing you to update Xcode and forcing you to face and fix these problems head on.

Maintaining CI is a pain in the proverbial. In a production environment, for a job and with a team it’s a necessity and you generally have better coverage, but for a solo project it’s great to have but a burden to maintain. Even if nothing changes on your side the world changes around us, sometimes you just gotta suck it up and fix it.

A Haiku About Debugging and the Perception of Productivity

2025-02-07T00:00:00+00:00

Often a small fix
Requires thorough debugging
Its trace left unseen

I wrote an almost-haiku in a Slack message to a coworker as we approached the end of the sprint. It came after spending the better part of two days debugging a difficult problem, which ultimately led to the addition of just two characters to a C++ source file to fix the issue. Along the way, I actually wrote a significant amount of ephemeral code across a data pipeline executable, a graphics runtime executable, and shader code. Earlier in my career, I struggled with deleting this kind of transient code. I often felt it might be worth keeping, just in case a similar issue arose in the future. I didn’t want my efforts to go to waste. But over time, I’ve learned to let it go. Debugging isn’t just about the code that remains; it’s also about the effort invested in understanding the problem. It includes all the temporary print statements, UI widgets, debug primitives, and countless other tools hastily written and discarded.

Then there’s the time spent in the debugger; stepping through execution, analysing hex values, copying and pasting into notepads, diffing outputs, crafting heroic watch expressions, or tracing obscure memory aliasing issues. All of this work requires experience and patience. I was fortunate to learn from engineers who were magicians at hardcore debugging, and they passed their wisdom on to me. In turn, I passed it down to the next generation. But still, a part of me feels the need to justify the work I’ve done, because some things are difficult to measure. When you finally fix a one-liner, it seems obvious in hindsight. It’s frustrating to think you didn’t find the solution sooner. Sometimes you were close, only for another clue to send you down the wrong tangent. Other times, you feel like a badass, pulling off some low-level trickery, only to realise it had nothing to do with the actual problem, and now it just feels like time wasted, showing off to yourself. Among respected colleagues and peers, we can talk about these endeavours. We learn, laugh, and congratulate one another. That’s never been an issue because we understand what it’s like. But often, there are people outside our world who don’t.

They want time logged. They want burn-down charts. They want to know why something took so long. They ask, “Why did you say this was a small t-shirt size?” (or whatever nonsense they think helps quantify complexity).

This isn’t meant as a dig at those people — they’re often just asking a question, not accusing anyone of wasting time. But for me, it triggers something. It makes me overthink, constantly trying to justify the effort. I suspect this comes from a deeply ingrained attitude toward work, conditioned by society:

That work means sitting at a desk from 9 to 5. That you must be physically present to be productive. That you must return to the office, not because it’s better, but because we don’t like the idea of you doing your washing at home during the workday. That, deep down, we are still mill workers.

Ironically, I don’t even have a return-to-office mandate. I’m not required to go in at all. But I still feel it. It’s been subliminally drilled into me since childhood: showing up is the job. It doesn’t account for the times I’ve solved a bug in my sleep. Or that time on the tube, when I mentally fixed an atomic race condition in a shader, then got to my desk and wrote a few lines of code to solve it, before slowly shifting into the headspace of the next challenge.

This haiku is my reminder - maybe even the start of reconditioning myself. To anyone else reading this: the unseen work, while ephemeral, is often the most important.

printf debugging is OK

2024-05-06T00:00:00+00:00

I stopped going on Twitter a while ago because it has the tendency to evoke rage, as it is designed to do. But every now and then I check back in - it can be useful sometimes for keeping up with graphics research, gamedev news and some people do post nice things, like sharing projects they are working on, so there is something to pull me back from time to time.

After checking the other day I saw this debate going around about not using an IDE or debugger, just using ‘notepad’ to write code. I looked in the comments, people arguing about who was right and all the usual toxic vibes, and it reminded me of some earlier occasions of people discussing the same topic.

It feels like the same old debate has been going on for a long time now, it’s packaged differently each time, but I don’t really know why people get so wound up about things. The main arguments are “if you need to use a debugger you’re an idiot and you don’t understand the code you are writing” (that’s not an actual quote but there was a similar take along those lines). Then there is “If you can’t use a debugger you’re an idiot”. The hating on the ‘printf’ crew is omnipresent.

At the risk of poking a hornet’s nest, I just wanted to share some thoughts and ideas on this subject in a balanced way, because I don’t think there needs to be an ultimate solution here. We need to debug code and there are tools out there to help us, some are more useful than others in certain situations, but at the end of the day do whatever you need to do to fix those bugs.

Debuggers

I use a debugger regularly, I will launch most work in C++ from Visual Studio or Xcode and preferably run in a debug build. I know for some people this is often a terrible UX because of the performance of debug builds, so a prerequisite here is fast debug builds. This is hard to retrofit but having a usable debug build is useful. Once running I can use the debugger break and step if I need to, and if I encounter a crash then there is a nice call stack I can look through in more detail.

I have noticed that it is extremely common for graduate and junior software engineers to have little to no debugging knowledge or experience. It’s not something that seems like it is taught at university and I have also been told stories of teachers imposing their usage of VIM and esoteric debugging strategies upon the students. For the record I am not a VIM user (another topic that ends up in polarising debates). I find using a mouse and 2 finger typing works for me.

The moment when you show someone how to use a hardware breakpoint or a watchpoint and find a bug immediately is like seeing the lightbulb appear on top of their head, a whole world of possibility opening in front of them, or the dismay of the wasted hours trying to catch some dodgy logic through layers and layers of object oriented spaghetti.

Some of those argue about using only ‘notepad’ and no debugger because they can dry run their code on paper and they “don’t write bugs”, but I find it difficult to understand how they work within a larger team project or codebase . A lot of bugs and issues I have ever had to fix were not in code I wrote myself, they were in legacy systems, colleague’s code, or in open source code (and some hard as nails bugs to track too!) that had been just lifted into a project. If you believe in the impending AI coding apocalypse then human engineers may merely be around to debug and fix issues with AI generated code. So yeah, being able to write perfect code yourself is one thing, but using a debugger to debug existing code in a large complex project shouldn’t be a thing of shame and we might need all the tools we can to help.

Along with debuggers we get all sorts of other tools, which also should be used as and when we need them. Address sanitizer can catch memory issues easily, where in a bygone era we would have this 1 in 1000 crash somewhere reading outside of an array bounds, we can enable ASan and catch this every time without the undefined behaviour lottery. Same for undefined behaviour sanitizer, now we can catch UB when it’s benign and not only when a noticeable side effect occurs.

I don’t know if these notepad-only coders are taking all of those tools off the table as well, but when you have something like ASan that can catch an issue for you I just don’t really know why you wouldn’t use it. I have seen a lot of comments that seem to suggest the debugger slows them down, but in this case I certainly think the debugger speeds you up.

So if you’re reading this and you don’t know about these tools I would say take a look and see, they can be useful and might be able to save you a lot of time. There are tons of things you can do and it’s hard to cover it all here. I learned a lot from working with other people and side by side debugging difficult problems. I think there should be more resources to teach these skills instead of it being handed down information.

Printf debugging is OK

So for the ‘printf’ haters I would also say that whilst using a debugger most of the time, sometimes I revert to ‘printf’ debugging. There are some situations where there is no other choice - in the past I have had to debug release builds where we were unable to reproduce the bug in debug. Even pulling in debug modules for the engine (for on screen debug info) changed the executable such that we couldn’t reproduce the issue. The last thing was to put a few print statements in using the raw ‘printf’ and removing them and adding more as we narrowed down the issue and eventually extracted enough information to fix the problem.

I have also had the need to use ‘printf’ when debugging certain kinds of behaviours in an application. In the case of something like touch event tracking for mobile devices, if you try to debug an issue with breakpoints you interrupt the hardware and it makes it difficult to reproduce issues in the same way they appear naturally. So here printing the state of touch down events, touch up events, and being able to see the logical flow can identify a problem. There are many more scenarios that benefit from this type of debugging. Just throw the prints in and make sure to remove them after so no one knew you were ever there, like a ninja.

Custom tools

Custom UI based debugging tools can go one step further than printf debugging, providing some similar traits but also allowing more flexibility and controllability, I assume the notepad wielders who don’t use a regular debugger must have some such custom tools and things to help them track down issues. I am a big fan of embedded debugging and profiling tools within an application. You know stuff like performance counters that I can just pop-up in a UI or tweakable values to help to refine behaviours or visual appearance. I find that since the explosion of ImGui the level of integrated ad-hoc debugging tools and info has exponentially increased.

But with these kinds of custom tools, I personally wouldn’t try and re-invent the wheel. I would aim to make stuff that complements the existing tools I can pull off the shelf. So for example I like to have a quick, at a glance profiler for all my key performance hotspots that I can check whenever I notice something. But for more in-depth profiling I would use a CPU or GPU profiler to dig deeper.

Just doing what needs to be done

At the end of the day, finding bugs is just something that we need to get done, whatever helps you find and fix the issue doesn’t bother me as long as we get the job done. On a closing note, I noticed some code in a pull request left in by accident by another person:

if(some_condition) {
    int x = 0;
}

I found this interesting. I do the same thing except I usually name my variable ‘a’. This is to insert some code where a breakpoint can be put on the ‘int x’ line and then it kind of acts like a conditional breakpoint when some_condition is true. You could use a conditional breakpoint within the debugger, but they can be slow and for me historically unreliable, but this little snippet gives you your own conditional breakpoint that works without fail.

Just make sure to remove the code before the PR next time!

Building a new graphics engine in Rust - Part 4

2023-04-29T00:00:00+00:00

Work has been continuing smoothly on my Rust graphics engine project hotline over the last month or so. I was slowly winding down from my current day job and have a little time off before starting a new role, so that has given me more time to dedicate to this project. I have been focusing on implementing different graphics demos and rendering techniques, which has thrown up a few missing pieces in the gfx backend and I am keen to get the API as complete as possible, because I am unsure of how much time I will have to work on it when I start my new role or even the validity of working on code in the public domain.

Tests

I started out the project building unit tests of graphics functionality and had those hooked up to run locally or on a self-hosted GitHub Actions runner. As the project progressed I encountered some issues with the tests crashing or being unable to run when launched within the plugin environment. I have a nice system of being able to switch between demos or in their current form they serve more as unit tests or examples, so I had been able to quickly, yet manually, run through the different examples to check things were all in good shape after making changes or refactoring. But still, automation is better and I was missing that support and comfort it can bring. I needed to resolve a crash inside my imgui backend where font glyph ranges passed to cimgui were actually pointing to dropped memory - this issue never seemed to crop up in debug or release builds and only in the tests, so it went undiagnosed for a while. The fix for this was fairly straightforward; just ensuring the memory remained in scope for when it was used.

Another issue with test running was that only one application can lock / use a dynamic library at a time, otherwise libloading will panic. Rust tests are launched and run asynchronously over multiple threads, so for the time being I have to run with -- -test-threads=1. This also helps with graphics related code because spawning 36+ Direct3D12 devices simultaneously is not possible and causes some of the tests to panic early on - failure to create a device is just a hard panic and there is no error handling. I suppose in this case I could wait and re-try and have some system there to at least allow some multithreading, but for the time being I am happy with the setup.

I also added support for each of the tests to take a grab of the backbuffer and write it to disk. I am not doing any kind of image detection on the images to automate pass or failure (currently if the test doesn’t panic or crash is enough to succeed) but having the images is a nice way to just glance and manually verify that everything looks correct.

Having these tests in place makes making changes or refactoring the lower level API’s easier and allows me to move quickly and confidently, which is exactly what I needed. They also run on the CI every time a commit is pushed and that helps to catch regressions where the shader data and the code become out of sync.

Hotline Data

With the tests in place I have made a few refactors and additions to the gfx API backend and I used the tests to aid this process. In my previous post I mentioned that I created a separate data repository to keep the main repository size down because crates.io has a 10mb limit. I originally created the hotline-data repository and used cargo to clone and update it on a build, so that the examples would still work whether using them from crates.io or GitHub. I actually decided against this and opted that the examples would work when using the repository direct from GitHub, and that if you decide to use the library from crates.io then you would configure data yourself for your own project. This subtle change enabled me to use hotline-data as a submodule and as a result that makes it easier to keep the data and the main repository in sync.

In the process of adding new graphics features I have had to make additions and changes to pmfx-shader. It comes bundled as a binary with the hotline-data repository, but while developing I switch to a development version which is actually written in python. Because things are moving quickly I have been frequently encountering new issues and switching to development mode. Now with the submodules and the tests this is helping to catch cases where I push to the repository with pmfx-shader in dev mode, so that I can quickly fix it and keep the repository in a state where it is buildable for new users at all times.

Dropping GPU Resources

I have previously mentioned challenges involving memory lifetime management between Rust and in-flight GPU resources, but I recently decided to bite the bullet and start handling these issues in the Drop trait for gfx::Texture and gfx::Buffer. I originally wanted to steer clear of this because it creates a dependency from a resource type to a gfx::Heap or a gfx::Device and then that in turn also throws in multithreaded considerations. I wanted to keep the low-level backend as simple and as dumb as possible, however from a user facing point of view it’s just too easy to run into serious problems such as a GPU hang / device removal (due to dropping an in-flight resource) or leaking views in heaps. This is because dropping a resource is very easy in Rust, you can simply allow a gfx::Texture or gfx::Buffer to go out of scope or assign a new one to a mutable variable.

The problem of dropping GPU resources started to rear its ugly head and force my hand as I started setting up some more complicated examples that were loading many textures. When switching between demos, textures were dropped as the bevy_ecs world was reset to default, but the associated shader resource views were not de-allocated from the shader heap. I also had issues with stretchy buffer types that resize like a vector for pushing in debug draw lines, or for light data and draw data in the bindless setup. When resizing, the previous smaller buffer would still be in-flight on the GPU and just dropping in place would lead to undefined behaviour. So this is where I really re-evaluated my thinking, just coming from a user perspective it’s a lot to think about and easy to fall into the trap.

In order to handle this I decided to just go for the full blown Arc wrapped around a DropList inside a gfx::Heap. All resources upon creation are assigned an Arc> to their respective heap and inside the Drop trait their resource views are added to the DropList. In future I would like to consider a lockless approach, but as I have done for the rest of the project I am focusing on stability first and the Mutex approach has worked well so far. In order to take ownership and add to the DropList the members inside a resource have now become Option’s so it’s possible to trivially std::mem::swap them, which is not great for the code elsewhere but it was a necessary change. I did try to just hack it and make a null version of an ID3D12_RESOURCE but this internally ends up causing a crash in windows-rs where a v-table is expected so the optional approach felt necessary. It adds some clutter in the backend, which was admittedly rushed, but that’s the price you pay for stability I suppose. When a resource is dropped the resource itself, any subresources (used for MSAA resolves), and any resource views are passed into the DropList owned by a heap:

/// Structure to track resources and resoure view allocations in `Drop` traits
struct DropResource {
    resources: Vec<ID3D12Resource>,
    frame: usize,
    heap_allocs: Vec<usize>
}

struct DropList {
    list: Mutex<Vec<DropResource>>
}

/// Thread safe ref counted drop-list that can be safely used in drop traits,
/// tracks the frame a resource was dropped on so it can be waited on
type DropListRef = std::sync::Arc<DropList>;

impl DropList {
    fn new() -> std::sync::Arc<DropList> {
        std::sync::Arc::new(DropList {
            list: Mutex::new(Vec::new())
        })
    }
}

/// Drop trait for a texture resource
impl Drop for Texture {
    fn drop(&mut self) {
        /// Compile time const allows this feature to be omitted
        if MANAGE_DROPS {
            // only grab resources if we have a drop list, this allows the swap chain rtv
            // to manage itself
            let mut res_vec = if self.drop_list.is_some() {
                // swap out the resources for None
                let mut res = None;
                std::mem::swap(&mut res, &mut self.resource);
                let mut res_vec = vec![
                    res.unwrap()
                ];
                if self.resolved_resource.is_some() {
                    let mut res = None;
                    std::mem::swap(&mut res, &mut self.resolved_resource);
                    res_vec.push(res.unwrap());
                }
                res_vec
            }
            else {
                Vec::new()
            };
            // texture resource views
            if let Some(drop_list) = &self.drop_list {
                /// Add resources to the drop list
                let mut drop_list = drop_list.list.lock().unwrap();
                let mut drop_res = DropResource {
                    resources: res_vec.to_vec(),
                    frame: 0,
                    heap_allocs: Vec::new()
                };
                res_vec.clear();

                /// Add resource views to the drop list
                if let Some(srv_index) = self.srv_index {
                    drop_res.heap_allocs.push(srv_index);
                }
                if let Some(uav_index) = self.uav_index {
                    drop_res.heap_allocs.push(uav_index);
                }
                if let Some(resolved_srv) = self.resolved_srv_index {
                    drop_res.heap_allocs.push(resolved_srv);
                }
                drop_list.push(drop_res);
            }
        }
    }
}

/// Since we now need to make resources `Options` and this adds extra baggage in the code elsewhere

//
let desc = unsafe { target.resource.as_ref().unwrap().GetDesc() }; // as ref, unwrap

//  
let barrier = if let Some(tex) = &barrier.texture {
    transition_barrier(
        tex.resource.as_ref().unwrap(),
        // ..
    )
}

We need to manually sweep and clean things up. This step is out of line with Rust’s memory model but we need to synchronise the delete with the swap chain to ensure that any remaining references are complete on the GPU. So at the end of each frame we have a little bit of housekeeping to do where we can check the current frame number vs the frame in which the resource was dropped. To avoid a dependency on the SwapChain during the Drop itself, the frame index is initialised to zero and it is set the first time we call cleanup. The cleanup code will finally drop internal Direct3D12 resources when safe and then add the associated resource views in heaps onto a FreeList so the handles can be recycled when a new allocation is made.

fn cleanup_dropped_resources(&mut self, swap_chain: &SwapChain) {
    /// lock drop list so it's thread safe
    let mut drop_list = self.drop_list.list.lock().unwrap();
    let mut free_list = self.free_list.list.lock().unwrap();
    let mut complete_indices = Vec::new();
    for (res_index, drop_res) in drop_list.iter_mut().enumerate() {
        // initialise the frame, and then wait
        if drop_res.frame == 0 {
            drop_res.frame = swap_chain.frame_index;
        }
        else {
            let diff = swap_chain.frame_index - drop_res.frame;
            if diff > swap_chain.num_bb as usize {
                // waited long enough we can add the resource views to the free list
                for alloc in &drop_res.heap_allocs {
                    free_list.push(*alloc);
                }
                drop_res.resources.clear();
                drop_res.heap_allocs.clear();
                complete_indices.push(res_index);
            }
        }
    }

    // remove complete items in reverse
    complete_indices.reverse();
    for i in complete_indices {
        drop_list.remove(i);
    }
}

// call clean up at the end of each frame, this could be deferred or ran at different times
pub fn run(mut self) -> Result<(), super::Error> {

    // ..

    // cleanup heaps
    self.pmfx.shader_heap.cleanup_dropped_resources(&self.swap_chain);
    self.device.cleanup_dropped_resources(&self.swap_chain)
}

I would note that now at this point the gfx::Texture struct has become a chunky 148 bytes (not just from the extra requirements for Drop but also subresource management, render targets, depth stencils etc). It’s not something I am super keen on, but since we pass around only usize shader resource handles to reference textures in a shader, the texture struct itself can be a heavyweight resource and won’t likely be required during ecs iterators and such; it’s more like you create it once and just keep a hold of it so the memory remains in scope while it is used on the GPU.

#[derive(Clone)]
pub struct Texture {
    resource: Option<ID3D12Resource>,
    resolved_resource: Option<ID3D12Resource>,
    resolved_format: DXGI_FORMAT,
    rtv: Option<TextureTarget>,
    dsv: Option<TextureTarget>,
    srv_index: Option<usize>,
    resolved_srv_index: Option<usize>,
    uav_index: Option<usize>,
    subresource_uav_index: Vec<usize>,
    shared_handle: Option<HANDLE>,
    // drop list for srv, uav and resolved srv
    drop_list: Option<DropListRef>,
    // the id of the shader heap for (uav, srv etc)
    shader_heap_id: Option<u16>
}

I did consider a mechanism to pass a reference to a command buffer, so that dropping a reference to a texture wouldn’t actually drop the internal resource, but in a bindless rendering setup this becomes a lot harder to track. You are just indexing into descriptor arrays on the GPU and you don’t have to physically bind anything like in a bindful rendering architecture, so I am happy with the drop trait handling for the time being. Still, I can foresee a lot of potential pitfalls but it’s a step in the right direction.

Resource Heaps

Initially I tried to abstract away the Heap concept for speed of development - by keeping a Heap as part of a Device this allows the ability to call device.create_texture() in a Direct3D11 kind of way. I still like this approach for quick demos and noodling around, but it became clear as more complexity began to emerge in the higher level pmfx library and entity component system that it would be a benefit to be able to create and manage your own heaps. This would allow the heaps to be dynamically resized and resources could be re-allocated or moved; an entire heap could be thrown away between levels in a game or switching between projects. I decided to allow both methods by having the following set of functions.

// Create buffer and create texture will add resource views into an internally managed heap owned by the device
fn create_buffer<T: Sized>(
    &mut self,
    info: &BufferInfo,
    data: Option<&[T]>,
) -> Result<Self::Buffer, Error>;

fn create_texture<T: Sized>(
    &mut self,
    info: &TextureInfo,
    data: Option<&[T]>,
) -> Result<Self::Texture, Error>;

// pass a heap to create buffer resoruce views on a user managed heap
fn create_buffer_with_heap<T: Sized>(
    &mut self,
    info: &BufferInfo,
    data: Option<&[T]>,
    heap: &mut Self::Heap
) -> Result<Self::Buffer, Error>;

// textures might require multiple heaps, you can provide your own or use the device managed heaps with None

pub struct TextureHeapInfo<'stack, D: Device> {
    /// Heap to allocate shader resource views and un-ordered access views
    pub shader: Option<&'stack mut D::Heap>,
    /// Heap to allocate render target views
    pub render_target: Option<&'stack mut D::Heap>,
    /// Heap to allocate depth stencil views
    pub depth_stencil: Option<&'stack mut D::Heap>,
}

// create texture with user managed heaps
fn create_texture_with_heaps<T: Sized>(
    &mut self,
    info: &TextureInfo,
    heaps: TextureHeapInfo<Self>,
    data: Option<&[T]>,
) -> Result<Self::Texture, Error>;

This allows for maximum flexibility, providing low level control where you need it or simpler ergonomics when you don’t.

ImGui Image Rendering With Multiple Heaps

Allowing user specified heaps threw up problems with imgui image rendering. Because now a Texture may reside in different heaps, a simple call to imgui.image(texture) did not provide enough context. Previously I was relying on only a single program-wide shader resource heap that was internally managed by the gfx::Device. Some data still resides in said heap, this is true for the imgui font texture, so I needed a way to pass this information around. Luckily due to the changes in dropping GPU resources I had more information contained in a gfx::Texture I could use at my disposal. Imgui images work by passing around a void* for a ImTextureID. Now, with Rust lifetimes I did not want to make this a full blown reference because simply all we need is the shader resource view handle, which is stored as a usize. The handles are allocated linearly inside a gfx::Heap and managed with a free list, so having a full range of 64-bits for shader resource handles is more than enough, even 32-bits is probably excessive. Each gfx::Heap gets allocated an ID and the ID’s are assigned sequentially upon creation, so I took the approach of packing the shader resource view handle and heap id together into 64-bits. 16 upper bits represent the heap id and the lower 48 represent the shader resource view handle. When passing a texture to imgui.image this process is handled for you, and then when we come to render imgui we just need to provide a vector of any additional user heaps.

// pass an array of heap references to imgui render. empty vector will use the device heap only
pub fn render(
    &mut self,
    app: &mut A,
    main_window: &mut A::Window,
    device: &mut D,
    cmd: &mut D::CmdBuf,
    image_heaps: &Vec<&D::Heap>,
)

// code to unpack the 16bit heap id and 48bit srv id
fn to_srv_heap_id(tex_id: *mut cty::c_void) -> (usize, u16) {
    let mask = 0x0000ffffffffffff;
    let srv_id = (tex_id as u64) & mask;
    let heap_id = ((tex_id as u64) & !mask) >> 48;
    (srv_id as usize, heap_id as u16)
}

fn render_draw_data() {
    // ..

    // extract srv and heap id from the packed texture id
    let (srv, heap_id) = to_srv_heap_id(imgui_cmd.TextureId);
    if heap_id == device.get_shader_heap().get_heap_id() {
        // bind the device heap
        cmd.set_binding(pipeline, device.get_shader_heap(), 1, srv);
    }
    else {
        // bund srv in another heap
        for heap in image_heaps {
            if heap.get_heap_id() == heap_id {
                cmd.set_binding(pipeline, heap, 1, srv);
                break;
            }
        }
    }
}

There is more than enough prevision in the 48 and 16 bits for any sane use-cases so I began to think I could also pack some extra info in there as well, maybe the texture type and render flags could be easily packed into 8-bits to allow changing shaders and rendering textures differently through imgui. This would open the opportunity to provide a more fully featured texture viewer, with alpha masking or different kinds of controls. I had implemented something similar before using custom callbacks, but I didn’t really like the overall architecture and just packing data into the ImTextureID is much nicer. But that’s something on the backburner for another day.

GPU-Hangs

I started to encounter intermittent, random GPU hangs and device removals as I began to add more complicated examples. When these occurred I would get misleading call stacks from crashes occurring during the stack unwind and no D3D12 validation errors or messages. I spent a while trying to pin down the problem in an old school fashion; commenting out code and simplifying, but I could never quite put my finger on what exactly was going on. In one particular example that had bindless draw, material, and light data lookups, with a fair amount of indirection in there it would sometimes crash on startup, but not all the time. Once the sample had booted it was stable. Other hangs would occur on some basic examples, but only when switching between. It was like the first frame would be intermittently unstable or not, and if you got past that then it would be OK. I verified my indices coming into the shaders and I did resolve a few things that looked like they might be the cause, but ended up being red-herrings, these were namely some out-of-order updates where render calls were being made before updates.

This sort of problem is really annoying to search for on the internet. Just searching for DXGI_DEVICE_HUNG throws up search results from various commercial games, with users on reddit and steam complaining about their games not working. Just so much clutter with useless info (update your drivers, get a new GPU). I wanted some developer focused information! I managed to find some forum posts on gamedev that mentioned EnableGPUValidation this can be enabled on a D312Debug1 interface:

// enable debug layer
let mut dxgi_factory_flags: u32 = 0;
if cfg!(debug_assertions) {
    let mut debug: Option<D3D12DebugVersion> = None;
    if let Some(debug) = D3D12GetDebugInterface(&mut debug).ok().and(debug) {
        debug.EnableDebugLayer();

        // slower but more detailed GPU validation
        if GPU_VALIDATION {
            let debug1 : ID3D12Debug1 = debug.cast().unwrap();
            debug1.SetEnableGPUBasedValidation(true);
        }

        println!("hotline_rs::gfx::d3d12: enabling debug layer");
    }
    dxgi_factory_flags = DXGI_CREATE_FACTORY_DEBUG;
}

I was able to reproduce the hangs after a few attempts and got some debug output, progress!

D3D12 ERROR: GPU-BASED VALIDATION: Draw, Uninitialized root argument accessed. Shader Stage: VERTEX, 
Root Parameter Index: [3], Draw Index: [0], Shader Code: <couldn't find file location in debug info>, 
Asm Instruction Range: [0xd-0xffffffff], Asm Operand Index: [0], 
Command List: 0x00000232EE9CC450:'Unnamed ID3D12GraphicsCommandList Object', 
SRV/UAV/CBV Descriptor Heap: 0x00000232EDB1C3C0:'Unnamed ID3D12DescriptorHeap Object', 
Sampler Descriptor Heap: <not set>, Pipeline State: 0x0000023299DFCB30:'Unnamed ID3D12PipelineState Object',  
[ EXECUTION ERROR #935: GPU_BASED_VALIDATION_ROOT_ARGUMENT_UNINITIALIZED]

At this point I was happy, I don’t mind something being wrong, especially if there is some validation telling me it’s wrong. It gives me the opportunity to work with it and fix the validation and then the symptoms. I do feel stressed if I am encountering a problem with no errors, warnings or validation messages. This makes me feel like I’m straying into territory of broken hardware or drivers which may be harder to work around, although having said that, in my experience such issues account for an extremely small percentage of problems I have encountered in my life as a programmer, and almost all issues I have ever faced have been self inflicted.

This validation error brought me back to shader registers, spaces, and root parameter indices. My original code was grouping all descriptor ranges by visibility and then creating a root parameter per shader visibility, which I go into more detail in the next section, but while I am here on the topic of GPU hangs / device removals I encountered another problem that did not have any validation output.

The cause of the issue came from population IndirectArgument unordered access buffers on the GPU as part of a GPU driven rendering setup. It took a while to track down because of the lack of information, but thanks to the useful size and alignment hints supplied by vscode I was able to notice that the size of the indirect argument structure was larger than expected - some padding was being added at the end. For all structures in use on the GPU I was using #[repr(C)] and I thought this would be enough to prevent this kind of problem, but in the end I needed to change to #[repr(packed) to prevent any padding being added.

// size = 72, align = 8
#[repr(C)]
pub struct DrawIndirectArgs {
    pub vertex_buffer: gfx::VertexBufferView,
    pub index_buffer: gfx::IndexBufferView,
    pub ids: Vec4u,
    pub args: gfx::DrawIndexedArguments,
}

// size = 68, align = 1
#[repr(packed)]
pub struct DrawIndirectArgs {
    pub vertex_buffer: gfx::VertexBufferView,
    pub index_buffer: gfx::IndexBufferView,
    pub ids: Vec4u,
    pub args: gfx::DrawIndexedArguments,
}

Bindless Rendering

I had done some initial exploratory work into bindless rendering in the very early stages of this project, but recently I started needing more data accessible on the GPU and that began to highlight changes that needed to be made both from a functionality point of view but also a usability perspective. With the aforementioned GPU hangs being caused by the bindless setup I started to look into that in more detail. The naming around this aspect of modern graphics api’s is quite confusing, it’s different in Vulkan and Metal and I don’t find Descriptors, RootSignatures or RootConstants very intuitive names to begin with. Since I had worked with Vulkan first I stuck with PipelineLayout, Descriptors and PushConstants. I did have to do a little backtracking here just to make sure everything was consistently named and in the context of this post I hope the concepts are clear enough to follow. A PipelineLayout describes Descriptors, PushConstants and Samplers that are used in the pipeline where Descriptors are arrays of resources such as textures, structured buffers, or constant buffers. PushConstants are a small amount of data that we can push into the command buffer on the CPU to have access to in a shader, and Samplers are used to sample textures and are the only part of all of this that has a sense of familiarity with older graphics APIs.

PipelineLayouts are automatically generated by my shader system pmfx-shader. Based on resource usage in shaders and a small amount of metadata, the pmfx-shader system is able to parse the code and automatically generate the layout. Binding heaps and push constants can be a little confusing because in the shader we specify which register and space we bind them to, and in the old days of bindful rendering, we would bind a texture onto the designated register or slot as I often would call them. I discovered that even with different registers or spaces the binding slot or ‘root parameter index’ (as Direct3D12 calls it) might not be as expected due to the auto-generated layout from pmfx-shader. Direct3D12 allows multiple descriptor ranges to be bound to the same slot; I am not sure the benefit of grouping more descriptors onto the same slot or keeping them separate, but they do need to be at the very least grouped by shader visibility, which is one of Vertex, Fragment, Compute etc or All of it is to be bound and accessible on multiple stages.

/// `PipelineLayout` is required to create a pipeline it describes the layout of resources for access on the GPU.
#[derive(Default, Clone, Serialize, Deserialize)]
pub struct PipelineLayout {
    /// Vector of `DescriptorBinding` which are arrays of textures, samplers or structured buffers, etc
    pub bindings: Option<Vec<DescriptorBinding>>,
    /// Small amounts of data that can be pushed into a command buffer and available as data in shaders
    pub push_constants: Option<Vec<PushConstantInfo>>,
    /// Static samplers that come along with the pipeline, 
    pub static_samplers: Option<Vec<SamplerBinding>>,
}

I toyed with a few implementations first, all of which felt cumbersome and I read through the msdn docs about bindings, multiple times and that took more than a while to sink in. In an attempt to make this as simple as possible for the user you can supply a vector of descriptors when creating a PipelineLayout. Under the hood the descriptors are grouped by type (srv, uav, cbv), register and shader visibility. This gives unique slots for different types of resources and opens the door to a bindful rendering model where it may be useful.

The key takeaway in regards to bindless rendering was that we bind a heap separately and then we can apply offsets within that heap to the different slots in the pipeline, but critically we can only bind a single descriptor heap at any one time. So in the context of a bindless rendering architecture we only have a single heap we use at any time and that contains all of our resources. Equipped with this knowledge it made it easy for me to add a utility function that would bind the heap for all of the slots, which need access to Descriptors… I still feel uncomfortable calling them descriptors, and still confused. I often refer to them as shader resources myself, but anyway. A quick example of how the code evolved over time:

fn render(cmd_buf: &gfx::CmdBuf, shader_heap: &gfx::Heap) {

    // 1. initially you would bind the heap on to the pipeline slot manually...
    cmd_buf.set_render_heap(1, shader_heap);

    // you would also have a separate function for compute
    cmd_buf.set_compute_heap(1, shader_heap);

    // and would need to bind to multiple slots
    cmd_buf.set_render_heap(2, shader_heap);
    cmd_buf.set_render_heap(3, shader_heap);
    cmd_buf.set_render_heap(4, shader_heap);

    // 2. I then added a utility so we know which slot is associated with a particular register
    let slot = pipeline.get_pipeline_slot(0, 0, gfx::DescriptorType::ShaderResource);
    if let Some(slot) = slot {
        cmd_buf.set_render_heap(slot.index, shader_heap);
    }

    /// this still required multiple binds
    let slot = pipeline.get_pipeline_slot(1, 0, gfx::DescriptorType::ShaderResource);
    if let Some(slot) = slot {
        cmd_buf.set_render_heap(slot.index, shader_heap);
    }

    // 3. moved to a single set_heap call with generics for compute and render pipelines
    pass.cmd_buf.set_heap(render_pipeline, &pmfx.shader_heap);

    // 
    pass.cmd_buf.set_heap(compute_pipeline, &pmfx.shader_heap);
}

Bindful Rendering

It is still useful sometimes to revert to a ‘bindful’ render model, the imgui backend code is doing this since the shader itself only uses a single texture and the ImTextureID is passed through code as previously discussed. I was also using this similar approach in a blit function that just needed access to a single texture. Here I added a utility function that can be used to obtain a PipelineSlot based on the register, space, and type of resource. An offset into the heap can then be applied to the PipelineSlot. The offset is supplied by obtaining a texture or buffers srv_index or uav_index; these can be used to bind the heap at an offset and essentially they become the same as an old-school Direct3D11 style renderer.

// set the heap
pass.cmd_buf.set_heap(pipeline, &pmfx.shader_heap);

// find the slot and bind the offset
let slot = pipeline.get_pipeline_slot(1, 0, gfx::DescriptorType::ShaderResource);
if let Some(slot) = slot {
    view.cmd_buf.set_binding(pipeline, &pmfx.shader_heap, slot.index, srv as usize);
}

The PipelineSlot API can also be used to get the correct location of push constants, again looking them up by register and space from the shader. The return value is optional so that it makes it possible to re-use shared code and only bind certain constants if a shader requires them. For example, some shaders may need push constants for the view matrix as well as an object world matrix, but some shaders may only need one or the other.

// bind view push constants
let slot = pipeline.get_pipeline_slot(0, 0, gfx::DescriptorType::PushConstants);
if let Some(slot) = slot {
    view.cmd_buf.push_render_constants(slot.index, 16, 0, gfx::as_u8_slice(&camera.view_projection_matrix));
    view.cmd_buf.push_render_constants(slot.index, 4, 16, gfx::as_u8_slice(&camera.view_position));
}

// bind the world buffer info
let world_buffer_info = pmfx.get_world_buffer_info();
let slot = pipeline.get_pipeline_slot(2, 0, gfx::DescriptorType::PushConstants);
if let Some(slot) = slot {
    view.cmd_buf.push_render_constants(
        slot.index, gfx::num_32bit_constants(&world_buffer_info), 0, gfx::as_u8_slice(&world_buffer_info));
}

GPU-Driven Rendering

In addition to the bindless architecture, I intend the final core ecs architecture to be GPU driven. GPU driven rendering allows command buffers to be populated on the GPU and this offloads CPU intensive work. This is where graphics APIs diverge quite significantly, so for this stage I am focusing on what is possible in Direct3D12 but I do have one eye on compatibility for other platforms. There is a nice detailed post on the webgpu issues page that outlines differences between graphics APIs. I had also had some prior experience with Metal as well so the concepts are relatively familiar. In short, Metal allows you to build entire command buffers on the GPU; Direct3D12 allows you to change a binding (vertex, index or descriptor), change push constants and draw arguments, but you can’t change pipelines; Vulkan allows you only to change draw arguments… that is without any extensions.

I will cross the bridge of cross platform support when I come to it, but in its current form hotline offers support for execute_indirect just as Direct3D12 does. I made a few samples to trial this:

// populate a buffer of draw arguments
let args = gfx::DrawArguments {
    vertex_count_per_instance: 3,
    instance_count: 1,
    start_vertex_location: 0,
    start_instance_location: 0
};

// create an INDIRECT_ARGUMENT_BUFFER
let draw_args = device.create_buffer(&gfx::BufferInfo{
    usage: gfx::BufferUsage::INDIRECT_ARGUMENT_BUFFER,
    cpu_access: gfx::CpuAccessFlags::NONE,
    format: gfx::Format::Unknown,
    stride: std::mem::size_of::<gfx::DrawArguments>(),
    initial_state: gfx::ResourceState::IndirectArgument,
    num_elements: 1
}, hotline_rs::data!(gfx::as_u8_slice(&args))).unwrap();

// create a command signature
let command_signature = device.create_indirect_render_command::<gfx::DrawArguments>(
    vec![gfx::IndirectArgument{
        argument_type: gfx::IndirectArgumentType::Draw,
        arguments: None
    }], 
    None
).unwrap();

// bind buffers and make the execute indirect call 
view.cmd_buf.push_render_constants(1, 12, 0, &world_matrix.0);
view.cmd_buf.set_index_buffer(&mesh.0.ib);
view.cmd_buf.set_vertex_buffer(&mesh.0.vb, 0);

view.cmd_buf.execute_indirect(
    &command.0, 
    1, 
    &args.0, 
    0, 
    None, 
    0
);

GPU Entity Frustum Culling

I set up a basic example with a large number of draw calls being submitted from the CPU to see how switching to execute_indirect would fare. The initial implementation loaded 32k entities with 30 unique meshes randomly selected, so the vertex and index buffers needed changing each draw call, with a single pipeline used for all meshes. This clocked in heavily CPU bound at about 80ms per-frame with no culling being performed what-so-ever.

Setting up for execute_indirect is fairly straightforward. The whole thing starts off with a buffer created on the CPU containing all of the draw arguments and buffer indices for each entity we want to draw. Then we create an unordered access buffer that the draw arguments will be copied into dynamically for only visible entities, after culling them on the GPU. Here it uses an AppendStructuredBuffer in the shader; this is a type not present in Metal shader language or GLSL, so in future I will have to implement some system to get the same behaviour, but essentially it consists of a buffer for data with a space in the buffer to store an atomic counter, which is incremented as we append items into the buffer. We can pass this counter to the execute_indirect call so it knows how many entities to draw.

// command signature specifies we change vertex and index buffers and update 2 push constants
let command_signature = device.create_indirect_render_command::<DrawIndirectArgs>(
    vec![
        gfx::IndirectArgument{
            argument_type: gfx::IndirectArgumentType::VertexBuffer,
            arguments: Some(gfx::IndirectTypeArguments {
                buffer: gfx::IndirectBufferArguments {
                    slot: 0
                }
            })
        },
        gfx::IndirectArgument{
            argument_type: gfx::IndirectArgumentType::IndexBuffer,
            arguments: None
        },
        gfx::IndirectArgument{
            argument_type: gfx::IndirectArgumentType::PushConstants,
            arguments: Some(gfx::IndirectTypeArguments {
                push_constants: gfx::IndirectPushConstantsArguments {
                    slot: pipeline.get_pipeline_slot(1, 0, gfx::DescriptorType::PushConstants).unwrap().slot,
                    offset: 0,
                    num_values: 4
                }
            })
        },
        gfx::IndirectArgument{
            argument_type: gfx::IndirectArgumentType::DrawIndexed,
            arguments: None
        }
    ], 
    Some(pipeline)
).unwrap();

// buffer is populated with draw call information for all entities
// read data from the arg_buffer in compute shader to generate the `dynamic_buffer`
let arg_buffer = device.create_buffer_with_heap(&gfx::BufferInfo{
    usage: gfx::BufferUsage::SHADER_RESOURCE,
    cpu_access: gfx::CpuAccessFlags::NONE,
    format: gfx::Format::Unknown,
    stride: std::mem::size_of::<DrawIndirectArgs>(),
    initial_state: gfx::ResourceState::IndirectArgument,
    num_elements: indirect_args.len()
}, hotline_rs::data!(&indirect_args), &mut pmfx.shader_heap).unwrap();

// append buffer created to copy visible entities into
// dynamic buffer has a counter packed at the end
let dynamic_buffer = device.create_buffer_with_heap(&gfx::BufferInfo{
    usage: gfx::BufferUsage::INDIRECT_ARGUMENT_BUFFER | gfx::BufferUsage::UNORDERED_ACCESS | gfx::BufferUsage::APPEND_COUNTER,
    cpu_access: gfx::CpuAccessFlags::NONE,
    format: gfx::Format::Unknown,
    stride: std::mem::size_of::<DrawIndirectArgs>(),
    initial_state: gfx::ResourceState::IndirectArgument,
    num_elements: indirect_args.len(),
}, hotline_rs::data![], &mut pmfx.shader_heap).unwrap();

// create a buffer with 0, so we can clear the counter each frame by copy buffer region
let counter_reset = device.create_buffer_with_heap(&gfx::BufferInfo{
    usage: gfx::BufferUsage::NONE,
    cpu_access: gfx::CpuAccessFlags::NONE,
    format: gfx::Format::Unknown,
    stride: std::mem::size_of::<u32>(),
    initial_state: gfx::ResourceState::CopySrc,
    num_elements: 1,
}, hotline_rs::data![gfx::as_u8_slice(&0)], &mut pmfx.shader_heap).unwrap();

fn render() {
    // reset the counter
    let offset = indirect_draw.dynamic_buffer.get_counter_offset().unwrap();
    pass.cmd_buf.copy_buffer_region(&indirect_draw.dynamic_buffer, offset, &indirect_draw.counter_reset, 0, std::mem::size_of::<u32>());

    // transition to `UnorderedAccess`
    pass.cmd_buf.transition_barrier(&gfx::TransitionBarrier {
        texture: None,
        buffer: Some(&indirect_draw.dynamic_buffer),
        state_before: gfx::ResourceState::CopyDst,
        state_after: gfx::ResourceState::UnorderedAccess,
    });

    // dispatch compute job to perform culling
    pass.cmd_buf.dispatch(
        gfx::Size3 {
            x: indirect_draw.max_count / pass.numthreads.x,
            y: pass.numthreads.y,
            z: pass.numthreads.z
        },
        pass.numthreads
    );

    // transition to `IndirectArgument`
    pass.cmd_buf.transition_barrier(&gfx::TransitionBarrier {
        texture: None,
        buffer: Some(&indirect_draw.dynamic_buffer),
        state_before: gfx::ResourceState::UnorderedAccess,
        state_after: gfx::ResourceState::IndirectArgument,
    });

    // draw indirect
    view.cmd_buf.execute_indirect(
        &indirect_draw.signature,
        indirect_draw.max_count,
        &indirect_draw.dynamic_buffer,
        0,
        Some(&indirect_draw.dynamic_buffer),
        indirect_draw.dynamic_buffer.get_counter_offset().unwrap()
    );
}

In the shader we use bindless lookups to obtain the entities extents data, camera planes, and perform a test against each plane in the frustum to detect whether an entity is inside or not. If the entity is visible, its draw data is copied into the indirect argument buffer and the counter is incremented.

struct indirect_draw {
    buffer_view         vb;
    buffer_view         ib;
    uint4               ids;
    draw_indexed_args   args;
};

// potential draw calls we want to make
StructuredBuffer<indirect_draw> input_draws[] : register(t0, space11);

// draw calls to populate during the `cs_frustum_cull` dispatch
AppendStructuredBuffer<indirect_draw> output_draws[] : register(u0, space0);

bool aabb_vs_frustum(float3 aabb_pos, float3 aabb_extent, float4 planes[6]) {
    bool inside = true;
    for (int p = 0; p < 6; ++p) {
        float3 sign_flip = sign(planes[p].xyz) * -1.0f;
        float pd = planes[p].w;
        float d2 = dot(aabb_pos + aabb_extent * sign_flip, planes[p].xyz);
        if (d2 > -pd) {
            inside = false;
        }
    }
    return inside;
}

[numthreads(128, 1, 1)]
void cs_frustum_cull(uint did : SV_DispatchThreadID) {
    // grab entity draw data
    extent_data extents = get_extent_data(did);
    camera_data main_camera = get_camera_data();

    // grab potential draw call
    indirect_draw input = input_draws[resources.input1.index][did];

    if(aabb_vs_frustum(extents.pos, extents.extent, main_camera.planes)) {
        output_draws[resources.input0.index].Append(input);
    }
}

The plane culling code is some old code from my C++ code base, it was implemented following this excellent article from ryg.

Switching from regular draw_indexed calls to draw_indirect improves the CPU time significantly (80ms to 16ms with v-sync) and the GPU time also drops due to the decreased vertex workload. I was able to then increase the number of entities and get the same performance. I did notice then that the program still remains CPU bound with higher draw or entity counts - this is due to the entities positions, world matrices, and bounds being updated on the CPU. More of this work could be offloaded to the GPU and there can also be consideration to manage static vs dynamic objects differently. There seems to be some increased CPU overhead with larger draw counts in execute_indirect I want to also investigate the difference in performance of indirect draw indexed vs draw indexed instanced.

There is still more research to do in this area. I don’t have a GPU that supports mesh shaders yet so I am still investigating what is possible without such luxuries, but I think instanced execute_indirect calls will be helpful and some triangle / cluster level culling could also be added for large meshes, I can see a few possible in-roads there without the need for mesh shaders. But while that stuff sits in the back of my mind, putting all of this together leads to the next section.

Bindless / GPU Driven Entity Component System

With the bindless setup and GPU driven examples in mind, some structure started to form for how these systems will be driven by entities in the entity component system. Across the various samples the following data is required for access on the GPU.

per-entity draw data (world matrix).
per-entity bounds / entents data for GPU culling.
material data (texture ids, colours, material parameters).
light data (positions, colours, attenuation factors).
camera data (projection matrices, view positions).

This data can be stored in StructuredBuffers and updated each frame. The pmfx system creates a set of WorldBuffers which is essentially a structure of arrays containing a DynamicBuffer, which is a structured buffer that can grow and stretch like a vector. I am using persistently mapped buffers to make updates to the GPU and multi-buffering the internals so each frame we write to a back buffer while the front buffer can be read on the GPU. The world buffers contains the following:

pub struct DynamicWorldBuffers<D: gfx::Device> {
    /// Structured buffer containing bindless draw call information `DrawData`
    pub draw: DynamicBuffer<D, DrawData>,
    /// Structured buffer containing bindless draw call information `DrawData`
    pub extent: DynamicBuffer<D, ExtentData>,
    // Structured buffer containing `MaterialData`
    pub material: DynamicBuffer<D, MaterialData>,
    // Structured buffer containing `PointLightData`
    pub point_light: DynamicBuffer<D, PointLightData>,
    // Structured buffer containing `SpotLightData`
    pub spot_light: DynamicBuffer<D, SpotLightData>,
    // Structured buffer containing `DirectionalLightData`
    pub directional_light: DynamicBuffer<D, DirectionalLightData>,
    /// Constant buffer containing camera info
    pub camera: DynamicBuffer<D, CameraData>
}

In the shader code we have these un-bounded bindless arrays of different resource types which all alias the same register but sit in a different space.

// structures of arrays for indriect / bindless lookups
StructuredBuffer<draw_data> draws[] : register(t0, space0);
StructuredBuffer<extent_data> extents[] : register(t0, space1);
StructuredBuffer<material_data> materials[] : register(t0, space2);
StructuredBuffer<point_light_data> point_lights[] : register(t0, space3);
StructuredBuffer<spot_light_data> spot_lights[] : register(t0, space4);
StructuredBuffer<directional_light_data> directional_lights[] : register(t0, space5);

// textures 
Texture2D textures[] : register(t0, space6);
Texture2DMS<float4, 8> msaa8x_textures[] : register(t0, space7);
TextureCube cubemaps[] : register(t0, space8);
Texture2DArray texture_arrays[] : register(t0, space9);
Texture3D volume_textures[] : register(t0, space10);

All resources go into the same gfx::Heap. I call this the shader_heap and it contains textures of all kinds as well as structured buffers. We can then use indices to look up the information we need on the GPU. Some things like materials can have 2 levels of indirection (first lookup the material by ID and then lookup textures by ID provided by the material). Depending on how draw calls are made, this information may come from different sources and I have explored a few different strategies I will cover later in this post, but for the simplest approach let’s just say that per draw call, we use PushConstants for each entity to push, and we need to push some constants that can tell us the id’s of each of the world buffers. The WorldBuffersInfo struct contains a pair of uint’s - one to identify the location of the buffer and one to notify the length of the buffer so we can loop over n lights and also offers some opportunity to perform some range checks. In the context of execute_indirect the id’s of the draw and material buffers are pushed through as part of the indirect draw arguments.

// bind view push constants
let slot = pipeline.get_pipeline_slot(0, 0, gfx::DescriptorType::PushConstants);
if let Some(slot) = slot {
    view.cmd_buf.push_render_constants(slot.slot, 16, 0, gfx::as_u8_slice(&camera.view_projection_matrix));
    view.cmd_buf.push_render_constants(slot.slot, 4, 16, gfx::as_u8_slice(&camera.view_position));
}

// bind the world buffer info
let world_buffer_info = pmfx.get_world_buffer_info();
let slot = pipeline.get_pipeline_slot(2, 0, gfx::DescriptorType::PushConstants);
if let Some(slot) = slot {
    view.cmd_buf.push_render_constants(
        slot.slot, gfx::num_32bit_constants(&world_buffer_info), 0, gfx::as_u8_slice(&world_buffer_info));
}

// bind the shader resource heap
view.cmd_buf.set_heap(pipeline, &pmfx.shader_heap);

Now in any shader we can look up the WorldBuffers and get a particular draw, material, or light data. I made some utility functions to assist this process which also makes the lookups more readable.

// get entity world matrix based on entity id
draw_data draw = get_draw_data(entity_input.ids[0]);

// get entity material based on id
material_data mat = get_material_data(entity_input.ids[1]);

// lookup lights and loop over
uint point_lights_id = world_buffer_info.point_light.x;
uint point_lights_count = world_buffer_info.point_light.y;

if(point_lights_id != 0) {
    int i = 0;
    for(i = 0; i < point_lights_count; ++i) {
        point_light_data light = point_lights[point_lights_id][i];

        // ..
    }
}

Compute Passes

The main scene scene can be rendered through render systems and driven by the bevy_ecs scheduler and I have now added support to provide compute passes inside .pmfx configs, which can be dispatched automatically or hooked into their own function if some custom code is required. I intend to do all post-processing through compute shaders and completely abandon rasterization post-processing. If all data required by a compute shader is supplied in a .pmfx config, it is very quick and easy to integrate new compute passes to the frame’s render graph:

textures: {
    compute_texture3d: {
        width: 64,
        height: 64,
        depth: 64,
        usage: [UnorderedAccess, ShaderResource]
    }
}

pipelines: {
    compute_write_texture3d: {
        cs: cs_write_texture3d
        push_constants: [
            resources
        ]
    }
}

render_graphs: {
    compute_test(base): {
        write_texture: {
            function: "dispatch_compute"
            pipelines: ["compute_write_texture3d"]
            uses: [
                ["compute_texture3d", "Write"]
            ]
            target_dimension: "compute_texture3d"
        }
    }
}

Bindless rendering again makes light work of the configuration because all textures and buffers we might want to use will already be allocated inside a heap, and each resource will have appropriate views setup based on usage flags supplied during creation. The only thing we need to know is the index of the resource view we wish to access a resource through. Resource usages can be specified in the .pmfx config, and you can also notify the target resource dimensions (the one which you are writing to) so that on the code size the group size can be automatically worked out.

textures: {
    gbuffer_albedo: {
        ratio: {
            window: main_dock
            scale: 1.0
        }
        format: RGBA16f
        usage: ["ShaderResource", "RenderTarget"]
        samples: 8
    }
    gbuffer_normal(gbuffer_albedo): {}
    gbuffer_position(gbuffer_albedo): {}
    gbuffer_depth(gbuffer_albedo): {
        format: D24nS8u
        usage: ["ShaderResource", "DepthStencil"]
    }
}

render_graphs: {
    multiple_render_targets_test: {
        meshes: {
            view: "heightmap_mrt_view"
            pipelines: [
                "heightmap_mrt"
            ]
            function: "render_meshes_pipeline"
        }
        resolve_mrt: {
            function: "dispatch_compute"
            pipelines: ["heightmap_mrt_resolve"]
            uses: [
                ["staging_output", "Write"]
                ["gbuffer_albedo", "ReadMsaa"]
                ["gbuffer_normal", "ReadMsaa"]
                ["gbuffer_position", "ReadMsaa"]
                ["gbuffer_depth", "ReadMsaa"]
            ]
            target_dimension: "staging_output"
            depends_on: ["meshes"]
        }
    }
}

This data is packed together with the resource dimensions, which are also sometimes useful to be able to lookup in the shader for sampling coordinates and so forth.

/// To lookup resources in a shader, these are passed to compute shaders:
/// index = srv (read), uav (write)
/// dimension is the resource dimension where 2d textures will be (w, h, 1) and 3d will be (w, h, d)
#[repr(C)]
pub struct ResourceUse {
    pub index: u32,
    pub dimension: Vec3u
}

/// Resoure uage for a graph pass
#[derive(Serialize, Deserialize, Clone)]
enum ResourceUsage {
    /// Write to an un-ordeded access resource or rneder target resource
    Write,
    /// Read from the primary (resovled) resource
    Read,
    /// Read from an MSAA resource
    ReadMsaa
}

// pass the resource usage indices as push constants
let using_slot = pipeline.get_pipeline_slot(0, 0, gfx::DescriptorType::PushConstants);
if let Some(slot) = using_slot {
    for i in 0..pass.use_indices.len() {
        let num_constants = gfx::num_32bit_constants(&pass.use_indices[i]);
        pass.cmd_buf.push_compute_constants(
            0, 
            num_constants, 
            i as u32 * num_constants, 
            gfx::as_u8_slice(&pass.use_indices[i])
        );
    }
}

struct resource_use {
    uint  index;
    uint3 dimension;
}

struct resource_uses {
    resource_use input0;
    resource_use input1;
    resource_use input2;
    resource_use input3;
    resource_use input4;
    resource_use input5;
    resource_use input6;
    resource_use input7;
}

ConstantBuffer<resource_uses> resources: register(b0);

[numthreads(8, 8, 8)]
void cs_write_texture3d(uint3 did : SV_DispatchThreadID) {
    // ..

    rw_volume_textures[resources.input0.index][did.xyz] = float4(nn, 0.0, 0.0, nn < 0.9 ? 0.0 : 1.0);
}

For simple compute passes there may be no need for a user to specify any additional data, so the resource usage information is automatically bound and the compute shader is dispatched automatically. This, in addition to being able to write multiple resources at the same time from a shader, and the opportunity to single pass jobs vs multi-pass ping-pong approaches as seen in raster based systems, is very appealing. I have yet to write any proper post-processes but the infrastructure is now in place.

It might be necessary to drive more complex compute workflows with scene data, which is evident in the GPU driven frustum culling example. Here, instead of having the compute shader automatically dispatched, you can supply a custom function that gets passed the useful aforementioned data (resource usage info, dimensions etc).

#[no_mangle]
pub fn dispatch_compute_frustum_cull(
    pmfx: &Res<PmfxRes>,
    pass: &mut pmfx::ComputePass<gfx_platform::Device>,
    indirect_draw_query: Query<&DrawIndirectComponent>) 
    -> Result<(), hotline_rs::Error> {

    // custom code to setup compute pipelines

    // ..

    pass.cmd_buf.dispatch(
        gfx::Size3 {
            x: indirect_draw.max_count / pass.numthreads.x,
            y: pass.numthreads.y,
            z: pass.numthreads.z
        },
        pass.numthreads
    );
}

After doing a few simple compute examples I was surprised not to see any Direct3D12 validation warnings or errors regarding resource states and prompts to insert transition barriers. At first I thought, ‘great’ it might not be something I need to worry about, but after using the GPU based validation I mentioned earlier to diagnose some GPU hangs, I noticed that there were some validation warnings spewing out in the console. Everything works fine so I haven’t tackled it yet, but having the validation messages to notify is very useful, and with the resource usage information in compute passes and also in render passes, the pmfx system will be able to automatically insert barriers as I am already doing for render target to shader resource, MSAA resolves and mip-map generation…

Generate Mip Maps

Direct3D11 and Metal provide mechanisms to generate mip-maps for textures at run time but Direct3D12 has no such inbuilt functionality. Generating mips for textures such as render targets can be quite useful so I added a quick utility to do this. It consists of a built-in compute shader which can perform the downsample iteratively. I initially tried to make a single pass downsample and had some reasonable results, but I put it on hold for the time being as it was taking longer than I initially anticipated. The internal implementation can be changed at a later time, here I just wanted to make sure the API was nice and easy to use. A gfx::Texture can be created with various flags, internally it may create multiple resources and resource views based on flags. You can create a textures that allows run time mip-map generation and a similar process is also followed for MSAA resolves:

// create texture with usage GENERATE_MIP_MAPS
let info = gfx::TextureInfo {
    width,
    height,
    tex_type,
    initial_state,
    usage | gfx::TextureUsage::GENERATE_MIP_MAPS,
    mip_levels,
    depth: pmfx_texture.depth,
    array_layers: pmfx_texture.array_layers,
    samples: pmfx_texture.samples,
    format: pmfx_texture.format,
}

// create texture with heap
let tex = device.create_texture_with_heaps::<u8>(
    &info,
    gfx::TextureHeapInfo {
        shader: heap,
        ..Default::default()
    },
    None
)?;

// resolve texture
cmd_buf.resolve_texture_subresource(tex, 0)?;

// generte mips for texture
cmd_buf.generate_mip_maps(tex, device, device.get_shader_heap())?;

Graphics Examples

I have completed a relatively comprehensive set of graphics examples which demonstrate and test the implemented features of the hotline API’s all integrated and using the entity component system kindly provided by bevy_ecs. Some of these examples are pretty basic and I am leaving them there for test purposes and to aid future work porting the engine to different platforms. Along the way I have been using these to explore different rendering techniques and get a rough idea of performance and will ultimately decide on a final architecture that will be used under the hood by the ecs. This final architecture is starting to take shape but I will do a quick run-down of the examples I have implemented so far. I went into more detail in previous post about how the ecs works, but the gist of it is that you supply setup, update, and render functions that are bevy_ecs systems and then a render_graph supplied in a pmfx config file. More information about each of the graphics examples can be found in the hotline GitHub repository.

Next Up

I am going to continue researching into GPU-Driven rendering techniques and hopefully start creating some more advanced looking demos. I hope you enjoyed this post, if you did you can follow the social links on my website for more content on whichever platforms you use.

Building a new graphics engine in Rust - Part 3

2023-03-03T00:00:00+00:00

Following on from the part 2 post a little while ago, I have been continuing work on my graphics engine hotline in Rust. My recent focus has been on plugins, multi-threaded command buffer generation and hot reloading for Rust code, hlsl shader code and pmfx render configs. I have made decent progress to the point where there is something quite usable and structured in a way I am relatively happy with. This leg of the journey has been by far the most challenging though, so I wanted to write about my current progress and detail some of the issues I have faced.

Here are the results of my first sessions actually using the engine. I created these primitives and used the hot reloading system to iterate on them to get perfect vertices, normals, and uv-coordinates. My intention is to use this tool for graphics demos and procedural generation so the focus is on making a live coding environment and not an interactive GUI editor. The visual client provides some feedback to the user and information but it is not really an editor as such, the editing goes in source code and data files which is reflected in the client. In time I may decide to add more interactive editor features but for now it’s all about coding. Here’s a demo video of some of the features:

I am using a single screen with vscode on the left and hotline on the right; by launching the hotline client executable from the vscode terminal it prints errors for hot reloading in the terminal and the line number links are clickable to automatically go to error lines.

Recap

I had previously created gfx, os and av abstraction API’s that currently have Windows-specific backend implementations, but the API’s are designed to easily add more platforms in the future. At this time I am trying to push as far ahead as possible on a single platform because I did spend a lot of time working on cross platform support in my C++ game engine or the engines I have worked on for my day job. Cross platform maintenance can become time consuming, so for a little while I have decided just to focus on feature development.

I have also been on a few side quests that have fallen under the umbrella of this graphics engine project, but I wrote about those separately. They were implementing imgui with viewports and docking, and maths-rs a linear algebra library I have been working on while away from my Windows desktop machine. A few people asked about maths-rs and why don’t I just use any existing library? I simply wanted something to work on using my laptop in the spare time I had and a maths library was the first thing I thought of. This is a downside of having a Windows only engine, my opportunity to work on it is limited to being chained to a machine in my house. I already had a C++ maths library that I ported a lot of the code from, but while I was there I improved the API consistency, added more overlap, intersection, distance functions to both libraries, and added more tests to assist porting to maths-rs. Now I’m at the point where I can use all the libraries to start building graphics demos.

Crates.io

You can use hotline as a library and use the gfx, os, av, imgui, pmfx, and any other modules it provides. It is now available on crates.io. To my dismay the crates.io registry for the name “hotline” was already taken as a placeholder by someone else. The same happened to maths, so for both my projects on crates.io I had to call them hotline-rs and maths-rs. It’s a bit disappointing that people claim the names and then haven’t produced any code yet. I’d be fine with someone claiming the name first if they actually had a decent, usable package.

I had some trouble with crates.io because the package size exceeded the lofty limit of 10mb! Most of my repository size was a result of some executables I am using to build data and shaders. I have these tools from prior work so I am still using the python based build system (built into executables with help from PyInstaller). To reduce the repository size I moved the executables and data files for the hotline examples into their own GitHub repository hotline-data, which is cloned inside hotline as part of a cargo build.

This data feature is optional and enabled by default, but it means you could bring your own build system or use the one provided, and more importantly this feature can be disabled from package builds / publishing to crates.io.

I also had some compilation issues when publishing the package to crates.io, because currently Windows is the only supported platform. The core API’s are generic using compile time traits, but samples and plugins need to instantiate a concrete type of GPU Device or operating system Window and there are no macOS or Linux supported backends yet. In time I would like to add a stub implementation for each module. I worked around this for now by making the entire files that require concrete types as Windows only.

Plugin-Architecture / Hot Reloading

The main work I have been focusing on for the last few weeks is making live reloadable code work through a plugin system, where plugins can be loaded dynamically at run time with no modifications required to the client executable. The client provides a very thin wrapper around a main loop, it creates some core resources such as os::App, gfx::Device and so forth. It provides a core loop that will submit command lists and swap buffers and makes it easy to hook in your own update or render logic. With the client running, plugins can be dynamically loaded from dylibs and code changes can be detected causing the library to be rebuilt and reloaded with the client still running. I am using hot-lib-reloader to assist the lib reloading, although I need to bypass some of it’s cool features like the hot_functions_from_file macro because I wanted to remove the dependency on the client knowing about the plugins.

Creating a new plugin is quite easy, first you need a dynamic library crate type. The plugins directory in hotline has a few different plugins that can be used as examples. But you basically just need a Cargo.toml like this

[package]
name = "ecs"
version = "0.1.0"
edition = "2021"

[lib]
crate-type = ["rlib", "dylib"]

[dependencies]
hotline-rs = { path = "../.." }

Inside a dynamic library plugin you can choose to get hooked into a few core function calls from the client each frame by implementing the Plugin trait:

use hotline_rs::prelude::*;

pub struct EmptyPlugin;

impl Plugin<gfx_platform::Device, os_platform::App> for EmptyPlugin {
    fn create() -> Self {
        EmptyPlugin {
        }
    }

    fn setup(&mut self, client: Client<gfx_platform::Device, os_platform::App>) 
        -> Client<gfx_platform::Device, os_platform::App> {
        println!("plugin setup");
        client
    }

    fn update(&mut self, client: client::Client<gfx_platform::Device, os_platform::App>)
        -> Client<gfx_platform::Device, os_platform::App> {
        println!("plugin update");
        client
    }

    fn unload(&mut self, client: Client<gfx_platform::Device, os_platform::App>)
        -> Client<gfx_platform::Device, os_platform::App> {
        println!("plugin unload");
        client
    }

    fn ui(&mut self, client: Client<gfx_platform::Device, os_platform::App>)
    -> Client<gfx_platform::Device, os_platform::App> {
        println!("plugin ui");
        client
    }
}

hotline_plugin![EmptyPlugin];

The hotline_plugin! macro creates a c-abi wrapper around the Plugin trait. I initially tried to use a Box which was returned from the plugin library to the main client executable so the trait functions could be called, but when trying to lookup the function in the vtable memory seemed to be garbage. After some investigation this seems to be because Rust does not have a stable-abi so I created the macro to work around this by allocating the plugin trait on the heap, pass a FFI pointer back to the client and then pass the FFI pointer into a c-abi which the macro generates. I expected with the same compiler that I wouldn’t need to make the wrapper API but was unable to get it working.

#[macro_export]
macro_rules! hotline_plugin {
    ($input:ident) => {
        
        // c-abi wrapper for `Plugin::create`
        #[no_mangle]
        pub fn create() -> *mut core::ffi::c_void {
            let ptr = new_plugin::<$input>() as *mut core::ffi::c_void;
            unsafe {
                let plugin = std::mem::transmute::<*mut core::ffi::c_void, *mut $input>(ptr);
                let plugin = plugin.as_mut().unwrap();
                *plugin = $input::create();
            }
            ptr
        }

        // ..
    }
}

ECS Plugin

All plugins do not necessarily have to implement the Plugin trait. Plugins can extend others in custom ways. I started on a basic ecs that uses bevy_ecs and the bevy Scheduler to distribute work onto different threads. The reason for this plugin-ception kind of approach is to be able to edit the core ecs while the client is running, as well as extension plugins, but also trying to make the whole thing as flexible as possible to allow freedom to implement totally different types of plugins and be able to work on them with hot-reloading.

To load functions from other libraries you can access the libs currently loaded in the hotline client. These are just a wrapper around libloading that allows you to retrieve a Symbol by name.

/// Finds available demo names from inside ecs compatible plugins, call the function `get_system_` to disambiguate
fn get_demo_list(&self, client: &PlatformClient) -> Vec<String> {
    let mut demos = Vec::new();
    for (lib_name, lib) in &client.libs {
        unsafe {
            let function_name = format!("get_demos_{}", lib_name).to_string();
            let list = lib.get_symbol::<unsafe extern fn() ->  Vec<String>>(function_name.as_bytes());
            if let Ok(list_fn) = list {
                let mut lib_demos = list_fn();
                demos.append(&mut lib_demos);
            }
        }
    }
    demos
}

The core ecs provides some functionality to create setup, update or render systems. You can add your own system functions inside different plugins and have the ecs plugin locate these systems to build schedules for different demos. All system stages get dispatched concurrently on different threads, so in time it’s likely more stages will be added to setup, update and render. Defining custom systems is quite straightforward. These are just bevy_ecs systems:

// update system which takes hotline resources `main_window`, `pmfx` and `app`
#[no_mangle]
fn update_cameras(
    app: Res<AppRes>, 
    main_window: Res<MainWindowRes>,
    mut pmfx: ResMut<PmfxRes>,
    mut query: Query<(&Name, &mut Position, &mut Rotation, &mut ViewProjectionMatrix), With<Camera>>) {    
    let app = &app.0;
    for (name, mut position, mut rotation, mut view_proj) in &mut query {
        // ..
    }

    // ..
}

Rendering systems get generated from render graphs specified through the pmfx system, and hook themselves into a bevy_ecs system function call.

// render system which takes hotline resource `pmfx` and a `pmfx::View`
#[no_mangle]
pub fn render_meshes(
    pmfx: &bevy_ecs::prelude::Res<PmfxRes>,
    view: &pmfx::View<gfx_platform::Device>,
    mesh_draw_query: bevy_ecs::prelude::Query<(&WorldMatrix, &MeshComponent)>) -> Result<(), hotline_rs::Error> {
    
    // ..
}

In order to dynamically locate and call these functions we need to supply a bit of boiler plate to look up the functions by name.

/// Register demo names for this plugin which is called `ecs_demos`
#[no_mangle]
pub fn get_demos_ecs_demos() -> Vec<String> {
    demos![
        "primitives",
        "draw_indexed",
        "draw_indexed_push_constants",

        // ..
    ]
}

/// Register plugin system functions
#[no_mangle]
pub fn get_system_ecs_demos(name: String, view_name: String) -> Option<SystemDescriptor> {
    match name.as_str() {
        // setup functions
        "setup_draw_indexed" => system_func![setup_draw_indexed],
        "setup_primitives" => system_func![setup_primitives],
        "setup_draw_indexed_push_constants" => system_func![setup_draw_indexed_push_constants],

        // render functions
        "render_meshes" => render_func![render_meshes, view_name],

        // I had to add this `std::hint::black_box`!
        _ => std::hint::black_box(None)
    }
}

I hope to use #[derive()] macros to reduce the need for the boilerplate code, but I haven’t really looked into it in much detail yet. I had to add std::hint::black_box around the None case in the get_system_ functions. I am here getting away without calling these functions without a c-abi wrapper so that might be the reason. Everything is working for the time being but I am prepared to address this if need be.

Pmfx

Another core engine feature I have been working on is pmfx which is a high level platform agnostic graphics API that builds on top of the lower level gfx API. The idea here is that the gfx backends are fairly dumb wrapper API’s and pmfx can bring that low level functionality together in a way which is shared amongst different platforms, pmfx is also a data driven rendering system where render pipelines, passes, views, and graphs can be specified in jsn config files to make light work of configuring rendering. This is not new code and it’s something I have worked on and used in other code bases, but it is currently undergoing an overhaul to bring it more inline with modern graphics API architectures. The main pmfx-shader repository contains the data side of all of this.

So how does it work? You can write regular hlsl shaders and then supply pmfx files, which are used to create pipelines, views, textures (and render targets) and more. views are like render passes but with a bit more detail, such as a function that can be dispatched into a render pass with a camera for example.

textures: {
    main_colour: {
        ratio: {
            window: "main_window",
            scale: 1.0
        }
        format: "RGBA8n"
        usage: ["ShaderResource", "RenderTarget"]
        samples: 8
    }
    main_depth(main_colour): {
        format: "D24nS8u"
        usage: ["ShaderResource", "DepthStencil"]
        samples: 8
    }
}
views: {
    main_view: {
        render_target: [
            "main_colour"
        ]
        clear_colour: [0.45, 0.55, 0.60, 1.0]
        depth_stencil: [
            "main_depth"
        ]
        clear_depth: 1.0
        viewport: [0.0, 0.0, 1.0, 1.0, 0.0, 1.0]
        camera: "main_camera"
    }
    main_view_no_clear(main_view): {
        clear_colour: null
        clear_depth: null
    }
}

The pmfx config files supply useful defaults to minimise the amount of members that need initialising to setup render state, and pmfx can parse hlsl files with extra context provided through pipelines to generate shader reflection info, descriptor layouts, and more, which is yet to come.

pipelines: {
    mesh_debug: {
        vs: vs_mesh
        ps: ps_checkerboard
        push_constants: [
            "view_push_constants"
            "draw_push_constants"
        ]
        depth_stencil_state: depth_test_less
        raster_state: cull_back
        topology: "TriangleList"
    }
}

You can supply render graphs which are built at run-time with automatic resource transitions and barriers inserted based on dependencies, this is still in early stages because my use cases are currently quite simple but in time I expect this to grow a lot more:

render_graphs: {
    mesh_debug: {
        grid: {
            view: "main_view"
            pipelines: ["imdraw_3d"]
            function: "render_grid"
        }
        meshes: {
            view: "main_view_no_clear"
            pipelines: ["mesh_debug"]
            function: "render_meshes"
            depends_on: ["grid"]
        }
        wireframe: {
            view: "main_view_no_clear"
            pipelines: ["wireframe_overlay"]
            function: "render_meshes"
            depends_on: ["meshes", "grid"]
        }
    }
}

You can take a look at a simple example of pmfx supplied with the hotline repository. Based on this file a reflection info file is generated, as well as recompiling the hlsl source into byte code with DXC. You can supply compile time flags that are evaluated and will generate shader permutations. Shaders which share the same source code, even though the permutation flags may differ, are hashed and re-used so as few as possible shaders are generated and compiled. pmfx also carefully tracks all shaders and render states so only minimal changes get reloaded.

Pmfx Rust

The Rust side of pmfx uses serde to serialise and deserialise json into hotline_rs::gfx structures so they can be passed straight to the gfx API. pmfx tracks source files that are dependencies to build shaders or pmfx render configs and then re-builds are triggered when changes are detected.

All of the render config states and objects have hashes exported along with them and these hashes can be used to check for changes against live resources in use inside the client. Only changed resources get re-compiled and reloaded. With checks to minimise shader rebuilds and checks on reloads this will hopefully mitigate compilation costs where combinatorial explosion can occur due to having many shader permutations.

The pmfx API can be used to load pipelines and render_graphs and then those resources can be found by name. Ownership of the resources remains with pmfx itself and render systems can borrow the resources for a short time on the stack to pass them into command buffers.

// load and create resources
let pmfx_bindless = asset_path.join("data/shaders/bindless");
pmfx.load(pmfx_bindless.to_str().unwrap())?;
pmfx.create_pipeline(&dev, "compute_rw", swap_chain.get_backbuffer_pass())?;
pmfx.create_pipeline(&dev, "bindless", swap_chain.get_backbuffer_pass())?;

// borrow resources (we need to get a pipeline built for a compatible render pass)
let fmt = swap_chain.get_backbuffer_pass().get_format_hash();
let pso_pmfx = pmfx.get_render_pipeline_for_format("bindless", fmt).unwrap();
let pso_compute = pmfx.get_compute_pipeline("compute_rw").unwrap();

// use resource in command buffers
cmdbuffer.set_compute_pipeline(&pso_compute);

// ..

cmdbuffer.set_render_pipeline(&pso_pmfx);

Views are a pmfx feature that start to lean into the bevy_ecs which contains entities such as cameras and meshes that can be used to render world views. A view contains a command buffer that can be generated each frame, they also have a camera (view constants) that can be bound for the pass and a render pass to render into. This is all passed to a bevy_ecs system function, which is dispatched on the CPU concurrently with any other render systems. Each view has its own command buffer and the jobs are read-only (aside from writing the command buffer), so they can be safely dispatched on different threads at the same time. You can build command buffers and make draw calls like this:

#[no_mangle]
pub fn render_meshes(
    pmfx: &bevy_ecs::prelude::Res<PmfxRes>,
    view: &pmfx::View<gfx_platform::Device>,
    mesh_draw_query: bevy_ecs::prelude::Query<(&WorldMatrix, &MeshComponent)>) -> Result<(), hotline_rs::Error> {
        
    let pmfx = &pmfx.0;

    let fmt = view.pass.get_format_hash();
    let mesh_debug = pmfx.get_render_pipeline_for_format(&view.view_pipeline, fmt)?;
    let camera = pmfx.get_camera_constants(&view.camera)?;

    // setup pass
    view.cmd_buf.begin_render_pass(&view.pass);
    view.cmd_buf.set_viewport(&view.viewport);
    view.cmd_buf.set_scissor_rect(&view.scissor_rect);
    view.cmd_buf.set_render_pipeline(&mesh_debug);
    view.cmd_buf.push_constants(0, 16 * 3, 0, gfx::as_u8_slice(camera));

    for (world_matrix, mesh) in &mesh_draw_query {
        view.cmd_buf.push_constants(1, 16, 0, &world_matrix.0);
        view.cmd_buf.set_index_buffer(&mesh.0.ib);
        view.cmd_buf.set_vertex_buffer(&mesh.0.vb, 0);
        view.cmd_buf.draw_indexed_instanced(mesh.0.num_indices, 1, 0, 0, 0);
    }

    // end / transition / execute
    view.cmd_buf.end_render_pass();

    Ok(())
}

Resource transitions are an important part of modern graphics API and I am aiming to make this as smooth as possible. In the pmfx file you can provide render_graphs, which will automatically insert transitions based on state tracking. As mentioned above, all of the view render functions are dispatched concurrently on the CPU, but on the GPU they get executed in a specific order based on the render graph’s dependencies, with appropriate transitions inserted in between. This is still quite bare bones because I am not doing anything overly complicated yet, but I expect this aspect of pmfx to require a lot more attention as the project progresses. pmfx also provides the ability to insert resolves for MSAA resources which is like a special kind of transition.

Challenges

This leg of the project has been by far the most challenging. I started to hit more difficulties with memory ownership than I had until now, and with the addition of bevy_ecs and multi-threading introduces different scenarios to handle. Mostly the difficulties are to do with memory ownership, borrowing and mutability. Sometimes the borrow checker can be brutal and small tasks to refactor code can send you down a wormhole you didn’t expect.

Refactoring

Refactoring in general I have found more difficult at times in Rust than in any other language. I tend to start things quite quickly and get something working; this typically means creating separate objects on the stack inside main and then that means the data is a bit more favourable to avoid overlapping mutability / borrowing issues.

When performing what initially seems like a simple refactor to bring that code more inline with where you want it to be or where your mental model is, you can hit a load of borrow checker errors and it turns out to be a more challenging task than you thought due to the mutual exclusion property of mutable references or just trying to move something to a thread, maybe some types can’t be Send and then this means you have to re-think how your data is grouped together or how it is synchronised across threads.

Ownership

Until this point I had mostly been dealing with objects on the stack that had been quite easy to either move or pass as reference through the call stack. Here are some examples of more complicated memory ownership:

Basic Lifetimes

I have a couple of places where I am using lifetimes, but have tried to steer away from them as much as possible. The current place I am using them is when passing info structures to create resources from a module backend.

/// Information to create a pipeline through `Device::create_render_pipeline`. where the shaders will be visible in the current stack
pub struct RenderPipelineInfo<'stack, D: Device> {
    /// Vertex Shader
    pub vs: Option<&'stack D::Shader>,
    /// Fragment Shader
    pub fs: Option<&'stack D::Shader>,

    // ..
}

/// The shader lifetime lasts long enough to pass to `Device::create_render_pipeline`
let vsc_info = gfx::ShaderInfo {
    shader_type: gfx::ShaderType::Vertex,
    compile_info: None
};
let vs = device.create_shader(&vsc_info, &vsc_data)?;

let psc_info = gfx::ShaderInfo {
    shader_type: gfx::ShaderType::Vertex,
    compile_info: None
};
let fs = device.create_shader(&psc_info, &psc_data)?;

let pso = device.create_render_pipeline(&gfx::RenderPipelineInfo {
    vs: Some(&vs),
    fs: Some(&fs),

    // ..
})?;

I have considered moving the objects that need lifetimes out of the structure and just passing them instead to the function so that lifetimes are not needed, but that means you lose the ability for defaults so I’m not sure. There is a situation to handle when passing a resource into a command buffer so that resource is to be used by the GPU as the resource can be dropped before it is used but I will cover that later.

Overlapping Mutability

Overlapping mutability has been tricky to get around at times. That is taking 2 mutable references to data that overlaps. It’s interesting to have to tackle this because it’s not something that you need to think about in C or C++, yet it is happening all the time, load-hit-stores occur when aliasing memory by two pointers as function arguments because they cannot be guaranteed to be different at compile time. Using restrict was something I used to do in the past when thinking about performance, but it’s not really something I think all that much about these days because the memory aliasing concept is quite abstracted. Rust forbids it and as a result you end up in difficult situations when grouping data in different ways.

It’s a natural instinct to want to bundle things together, maybe coming from a C background with context passing has got me leaning this way. But in hotline one of the more difficult scenarios I hit was when creating the Client. It felt to me natural that the Client could bundle together some common core functionality and be something passed around between plugins. But trouble arises when you need to borrow 2 members at the same time.

// call plugin ui functions
for plugin in &mut self.plugins {
    // ..

    self = ui_fn(self, plugin.instance, imgui_ctx); // cannot move self because it is borrowed as mutable (`for plugin in &mut self.plugins`)
}

I was able to work around this particular instance by moving the plugins into another vector, but also then I had to separate what members were part of the Plugin so that the libs could be accessed inside plugin functions to allow them to find functions to call.

// take the plugin mem so we can decouple the shared mutability between client and plugins
let mut plugins = std::mem::take(&mut self.plugins);

// call plugin ui functions
for plugin in &mut plugins {
    // ..

    self = ui_fn(self, plugin.instance, imgui_ctx); //now we can move self 
}

This illustrates to me how data ownership and grouping is quite a different beast in Rust to what I am used to.

Iterator Consumers

I found myself breaking apart algorithms and doing things like finding data that needs to be mutated in one pass, gathering the results then iterating over the results in a separate pass to separate the mutability.

// iterate over `pmfx_tracking` to check for changes, and reload data
for (_, tracking) in &mut self.pmfx_tracking {
    let mtime = fs::metadata(&tracking.filepath).unwrap().modified().unwrap();
    if mtime > tracking.modified_time {
        
        // perform a reload
        self.shaders.remove(shader); //!! this is not possible as `self` is already borrowed (`for (_, tracking) in &mut self.pmfx_tracking`)
        // ..

        // update modified time
        tracking.modified_time = fs::metadata(&tracking.filepath).unwrap().modified().unwrap();
    }
}

My instinct initially went for imperative style loops, but since then I started to adopt the iterator patterns using filter, map, fold, and collect. This means that creating a mutable collection and inserting into that collection can be replaced with an immutable collection.

// first collect paths that need reloading
let reload_paths = self.pmfx_tracking.iter_mut().filter(|(_, tracking)| {
    fs::metadata(&tracking.filepath).unwrap().modified().unwrap() > tracking.modified_time
}).map(|tracking| {
    tracking.1.filepath.to_string_lossy().to_string()
}).collect::<Vec<String>>();

// iterate over the paths we want to reload
for reload_filepath in reload_paths {
    if !reload_filepath.is_empty() {

        // repeat similarly inside, collecting resources that need updating first

        // find textures that need reloading
        let reload_textures = self.textures.iter().filter(|(k, v)| {
            self.pmfx.textures.get(*k).map_or_else(|| false, |src| {
                src.hash != v.0
            })
        }).map(|(k, _)| {
            k.to_string()
        }).collect::<HashSet<String>>();

        // ..

        // reloading outside of any iterator tied to self (self here is mutable)
        self.recreate_textures(device, &reload_textures);
    }
}

I have started to think this way a bit more, but it’s a little alien to me when I can achieve the same thing with a simple loop.

Moves

For the plugin libs and for bevy_ecs particularly I need to pass the hotline modules as resources to the ecs systems. In the end I settled on moving the entire hotline Client into the plugin functions, into the ecs World and back out again. The core hotline modules also need to be wrapped up to be a Resource for bevy_ecs.

This feels quite nice in a way, that in each plugin you have full ownership of hotline and can do what you like, which makes it possible to do things like asynchronous system updates through bevy Scheduler and let that have full control.

// move hotline resource into world
self.world.insert_resource(session_info);
self.world.insert_resource(DeviceRes(client.device));
self.world.insert_resource(AppRes(client.app));
self.world.insert_resource(MainWindowRes(client.main_window));
self.world.insert_resource(PmfxRes(client.pmfx));
self.world.insert_resource(ImDrawRes(client.imdraw));
self.world.insert_resource(UserConfigRes(client.user_config));

// update systems
self.schedule.run(&mut self.world);

// move resources back out
client.device = self.world.remove_resource::<DeviceRes>().unwrap().0;
client.app = self.world.remove_resource::<AppRes>().unwrap().0;
client.main_window = self.world.remove_resource::<MainWindowRes>().unwrap().0;
client.pmfx = self.world.remove_resource::<PmfxRes>().unwrap().0;
client.imdraw = self.world.remove_resource::<ImDrawRes>().unwrap().0;
client.user_config = self.world.remove_resource::<UserConfigRes>().unwrap().0;
self.session_info = self.world.remove_resource::<SessionInfo>().unwrap();

It requires a small amount of hokey-cokey to do so, which is also a little strange but I kind of like it, I had to break out the different modules inside the client to avoid overlapping mutability when using the SwapChain, Device and CmdBuf.

Arcs

I am aware I could wrap everything in an Arc to get interior mutability and that might remove the need for the moves, but I decided to try and use them only where necessary. I am using Arc and Mutex anywhere inter-thread synchronisation is necessary. I quite like lockless data structures in C. I will take a look at tokio when I get a chance but for now I was going with a heavy handed approach in a few places just to get the program structured how I would like. I have this Reloader and ReloadResponder setup that watches files and flags when changes have occurred, triggers a rebuild and reloads, the ReloadResponder is also the only place using dyn dispatch, there is still more work to do in that area as I struggled with trying to achieve polymorphic behaviour that I would implement in C++.

Another place using an Arc is in pmfx because Views need to be mutable in render functions and they are located within a HashMap, so it’s not possible to borrow mutable from pmfx itself without interior mutability. This led me to go with an Arc, however a move should be viable because only a single render system will ever write to a single view, so here it does feel unnecessary to require an Arc but I also had to work with bevy_ecs systems here which forced my hand slightly. A RefCell might also be a better option in this scenario.

In-Flight GPU Resources

Within a multi-buffered GPU rendering system, while the CPU is building command buffers for the current frame, another previous frame is being executed concurrently on the GPU. This introduces issues where if we decide to drop a resource on the CPU side, it may still be in-use on the GPU and by dropping the resource this will cause a D3D12 validation error and can lead to a device removal. I encountered this issue first with textures used for videos in the av API, so I added a destroy_texture function that passes ownership of the texture to the GPU Device and a function cleanup_resources will check the resources are no longer in use before dropping them. This goes a little against Rust’s memory model with the need to explicitly drop at the right time.

// swap the texture to None, and pass ownership of texture to the device. Where it will be cleaned up safely
let mut none_tex = None;
std::mem::swap(&mut none_tex, &mut self.texture);
if let Some(tex) = none_tex {
    device.destroy_texture(tex);
}

In some places where full reloads are taking place there is a useful function on a SwapChain, which can wait for the last submitted frame to complete on the GPU and then any drops happening will be guaranteed to be safe before any new frames are submitted.

// check if we have any reloads available
if self.reloader.check_for_reload() == ReloadState::Available {
    // wait for last GPU frame so we can drop the resources
    swap_chain.wait_for_last_frame();
    self.reload(device);
    self.reloader.complete_reload();
}

This solution is much nicer because the drop can just happen naturally. It might not be possible or desired to perform this hard sync with the GPU, so in future I expect to have to use the destroy functions more (and add them for different resource types).

Build Times / Linker Issues / Debugging

Build times are currently the biggest problem; a plugin takes around 6 seconds to build with a little extra to complete the reload, so live code editing does not feel hugely responsive. Reloading shaders or render configs is very fast though, so that balances it out a bit if you work across code and shaders, and is all the more reason to use more GPU driven techniques / compute. The build times in full debug builds are much slower, but because of an issue with more than 65535 symbols exported from a plugin, which is not supported by the MSVC toolchain, I am forced to switch to 01 optimization for debug and that has similar performance to release.

= note: LINK : fatal error LNK1189: library limit of 65535 objects exceeded

Profiling Build Times

This post has lots of detailed info about build times. I have tried to profile the build with the cargo build -Z timings option but it is only available on the nightly channel. Switching to nightly made it possible to run with the flag but I couldn’t see any output cargo-timings files, I wonder if the -Z timings is not available on Windows? As a result I am shooting in the dark a little here. I have done some exploration to figure out what might work best.

Experimenting With Build Times

I tried to separate out the plugins over more libs so that the core hotline lib did not have to depend on bevy_ecs. This didn’t make much of a positive difference because it created the requirement for an additional plugin with shared code. This attempt ended up with 5 total build artefacts, which ended in 13 second build times.

hotline_rs.dll
client.exe
ecs_base.dll
ecs.dll
ecs_demos.dll

Each library or executable that requires building adds a noticeable constant cost which seems like the link time. So reducing the amount of libs actually helped improve build times and adding more only increased the build time. I moved the ecs_base plugin into hotline, which reduces the number of build artefacts and brings me to 6 second build times. If I build a single lib and executable the build time is around 10 seconds, so the live building is an improvement, if still not where I would like it to be but maybe this can improve in time.

For an end user the desired result is that they would not need to modify the client, the hotline lib or the core ecs and only work inside a plugin such as ecs_demos. So this has about a 6 second build time, which is not too bad, however when working on the core engine itself care needs to be taken to make sure the plugins and the core libs are in sync so that means building more artefacts. Building from clean and switching between release and debug also added a cost, so keeping the number of libs down was the way to go.

Avoiding Unnecessary Builds

Due to the build times being fairly long I had to ensure that any builds were not being triggered erroneously when they did not need to. With shaders inside the main repository it causes cargo build to think that hotline needs rebuilding when shaders have changed, even though this should not affect any of the libs or the executables. This is particularly painful because if modifying a shader the client is able ro rebuild and reload the shaders and associated pipelines very quickly, however the next time code is modified in a plugin it causes the plugin to rebuild the main lib as well, causing the total build time to reach about 10 seconds. If the shaders are excluded from the package in Cargo.toml then the issue does not occur, but the data is necessary to ship to users.

Due to other constraints I ended up moving the data into a separate repository, which mitigates the issue of rebuilding the main library. For now the problem is kept at bay, but it is a bit more work to maintain changes in the data repository. I need to find a more long term solution to this. Even editing the todo.txt file I have inside the repository causes a cargo build to take a few seconds. I know you can exclude directories from the package, which resolves the issue, but also I would like these things to publish to crates.

Working in Plugin Environment

Debugging in general is more difficult in the plugin environment, if you are attached to the debugger plugin rebuilds will fail because the .pdb is locked. So sometimes it means resorting to println! debugging when you need to debug the hot reload process itself.

I added support for serialisation of the basic program state, which is synchronised between release and debug builds. It keeps the camera position and the currently selected demos, so even from a full restart you are right back where you left off. This makes those times where something goes terribly wrong, or the times you need to edit the core engine, just a little bit easier.

For the convenience that having hot reloaded plugins aims to provide, developing in that environment is quite tricky, so currently it feels like having plugins is an extra burden to carry around. It would be nice to be able to switch between statically linked and dynamically linked plugins and that would also be a good option for a final packaged build of an application, where the hot reloading would not be required. This is something I will look into when I get a chance.

Error Handling

I have spent quite a lot of time handling errors and propagating them in a way to allow the client to continue running gracefully should something go wrong. It’s quite easy to quickly get things working just using unwrap to panic if something is missing, fix the issue and leave the unwrap there. That’s how I tend to like working in other code bases; if something is missing, assert and then fix that before moving on. In certain situations, like a game for instance, missing data should not be present in a final build so having all of the code to gracefully handle it always felt like extra baggage. But in some situations like this code base and in tools, you need to allow things to go wrong and interactively resolve them.

Luckily Rust is really good at error handling and it actively wants you to do so, even in cases where I would hit a panic for a missing shader or some other data which may have had a typo in or an incorrect path. When I was hitting the panic I would know exactly where and then returning Result from a function allows the use of ?, which makes a significant improvement to the readability of the code by reducing the need to unwrap. Here’s a bloated messy initial setup, partly down to no being sure how to handle errors in bevy_ecs systems.

#[no_mangle]
 pub fn render_meshes(
     pmfx: bevy_ecs::prelude::Res<PmfxRes>,
     view_name: String,
     mesh_draw_query: bevy_ecs::prelude::Query<(&WorldMatrix, &MeshComponent)>) {

    // this is just code needed to get gfx resources and unwrap them to use in command buffer generation
    let arc_view = pmfx.get_view(&view_name);
    if arc_view.is_none() {
        return;
    }
    let arc_view = arc_view.unwrap();
    let view = arc_view.lock().unwrap();

    let fmt = view.pass.get_format_hash();

    let mesh_debug = pmfx.get_render_pipeline_for_format("mesh_debug", fmt);
    if mesh_debug.is_none() {
        return;
    }
    let mesh_debug = mesh_debug.unwrap();

    let camera = pmfx.get_camera_constants(&view.camera);
    if camera.is_none() {
        return;
    }
    let camera = camera.unwrap();

    // ..
}

And with the proper result propagation it looks much better, I was also able to unwrap and pass the View into the function instead of fetching it by name because now I was calling the function from a closure which gives a bit more control.

#[no_mangle]
 pub fn render_meshes(
     pmfx: bevy_ecs::prelude::Res<PmfxRes>,
     view_name: String,
     mesh_draw_query: bevy_ecs::prelude::Query<(&WorldMatrix, &MeshComponent)>) -> {

    let fmt = view.pass.get_format_hash();
    let mesh_debug = pmfx.get_render_pipeline_for_format(&view.view_pipeline, fmt)?;
    let camera = pmfx.get_camera_constants(&view.camera)?;

    // ..
}

There are still lots of combinations and things to test when it comes to error handling so I have added some initial tests to try and catch things that might go wrong, but I foresee this as ongoing work and need to get in the habit of thinking of that earlier on instead of quickly getting something working and refactoring.

What’s Next?

I’m pretty happy with the overall program structure and also the stability has been great. I think that’s a good affirmation that all of the hard work playing ball with the borrow checker pays off in the long run. I still think there’s a lot to think about in terms of memory ownership, this is one area I’m not as certain about as anything I have worked on for a long time. Next up I will be starting to add lighting and shadows, firstly I need to add these concepts into the ecs and then plan to work on clustered lighting and virtual shadow maps. There are a few bits of pmfx I need to add and hookup to make that possible, but in general the graphics side of the engine is really coming along.

I posted about this both on twitter and mastodon, I was keen to move to mastodon but still finding much more engagement on twitter. Give me a follow if you’re interested and check out the GitHub or crates.io page.

Building a gamedev maths library in Rust from scratch

2022-08-27T00:00:00+00:00

I have just finished up a linear algebra maths library in Rust and it’s available on crates.io. It contains the usual implementation of vectors, matrices and quaternions but also tons of useful intersection, distance functions, point tests, graphs, utility functions and ergonomic decisions to hopefully make this fun and nimble to use for gamedev and graphics coding. I have been spending small chunks of time writing functions and tests over the summer, it has been quite enjoyable and I have learned a lot more about the Rust programming language, especially going into more detail with traits and trait bounds than I have previously, and also my first real work with macros.

There are already many other Rust maths libraries available on GitHub or Crates.io. This website are we game yet has a list of gamedev libraries for Rust. I tried a few of them, such as cgmath and nalgebra in my graphics engine for a short while but I ended up wanting more and being interested in how I would implement one myself. A lot of maths libraries out there; for all the languages you can think of, usually implement vectors, matrices and quaternions. What is less common is a comprehensive collection of intersection tests, distance and utility functions… in fact SIMD support is probably more common in a maths library than a ray triangle intersection function. I had already been through this process and, over a number of years, had accumulated a decent set of functionality in my C++ library, so I decided to essentially port that functionality to Rust. My initial plan was to use an existing Rust library for the vector, matrix and quaternion implementations and then just implement the intersection and utility functions, but after a while I wanted to make changes to try and get my Rust library stylistically closer to the C++ one. At first I wasn’t sure if what I wanted would be possible, but in the end I am happy with the results. You can take a look at the full documentation or readme, which give a detailed overview of the feature set.

C++ Maths Library

As a gamedev with a strong focus towards graphics a maths library is an essential tool to have in your toolbox. Since I started coding and working on games I slowly built up a maths library in C++, at this point it has had many changes and I have accumulated a lot of functions over the years from different sources, books, websites, and blogs. One of the biggest influences was this blog post by Nathan Reed. He outlined how to make a vector library that is templated by both type and size to allow n-dimensional vectors and only needing to implement one set of functions for the entire thing. Prior to this I had never been a huge fan of templates due to the escalation of complexity once you start nesting and combining them, and in general the unreadable hard to follow C++ standard library. But here was a really concrete example of something a little more complex than a container class which added value. I adopted this approach and started to enjoy template meta-programming. I went to great lengths to implement these swizzles which look and feel just like writing maths code in a shader.

vec4f swizz = v.wzyx;       // construct from swizzle
swizz = v.xxxx;             // assign from swizzle
swizz.wyxz = v.xxyy;        // assign swizzle to swizzle
vec2f v2 = swizz.yz;        // construct truncated
swizz.wx = v.xy;            // assign truncated
swizz.xyz *= swizz2.www;    // arithmetic on swizzles
vec2 v2 = swizz.xy * 2.0f;  // swizzle / scalar arithmetic

Shader Code

For writing maths code, shaders are somewhat of a gold standard to me, functions such as dot and cross are built in and the maths code is just part of the language. With my C++ library I wanted the same feeling so you can do stuff like this, with function overloads and the ability to operate on different sized vectors and scalars with function calls:

float m = min(f, f); // min of float
float3 m3 = min(v, v); // min of float3
float dp3 = dot(v, v); // dot on float3
float dp4 = dot(v4, v4); // dot on float4
// and so on..

Writing graphics algorithms, gameplay code, or procedural generation code can build up quickly and become quite verbose. This is why I like commonly used functions such as dot or cross just to be in scope. I have seen things in maths libraries which end up with vector.dotProduct(other) all over the place and find this sort of thing hard to read and follow. A hill I will die on (and I know it’s an unpopular opinion in some circles but here we go anyway) is that single letter / short variable names are OK. They can help with readability if they make the code more compact, and with comments you can make the overall algorithm more readable where the focus is on the operations instead of variable names. Having a million myVector.dotProduct(someOtherVector) generates so much noise and could be equivalent to dot(v, o)… A lot of mathematical notation is just single greek symbols so it comes with territory, and check out a lot of stuff on shadertoy, this is full of single letter variables too… if you hate that kind of thing maybe don’t check shadertoy you might have a heart attack.

What is possible in Rust?

I set out to see if it was possible to get what I wanted out of Rust. The aim here was to get something as close as possible to my C++ library that looked and felt like a shader language. This would also make it easier to port code between my C++ code base, shaders and Rust, but also make something that is ergonomic and fun to use. At this point I want to stress that most of this work was to get something that looks and feels how I wanted it to. I am not worrying about SIMD from the start, I will look into it in the future but I was happy with a scalar implementation to begin with; my C++ library is also a simple scalar implementation. I do have interest in performance so wanted to keep an eye on it, but the primary focus was ergonomics. For really heavy computations I would be inclined to use compute shaders or write specialised SIMD routines, but this library is aimed more toward game mechanics or procedural generation.

The other maths libraries around take different approaches to the internals of a vector struct. Some of them implemented concrete types of Vec2, Vec3 and Vec4, while some others go for an entirely n-dimensional approach for wider linear algebra. For the kind of thing I am working on I want to be able to access .x or .y members; I tend to do a lot of this for gameplay code, flatting movement onto an xz-plane by setting vec.y = 0.0 so this made the n-dimensional approach less appealing. I do only need 2, 3 and 4 dimensional vectors so having 3 implementations isn’t too bad, but it is a bit of repetition so I wanted to try and consolidate this like I did with C++ templates. It is possible to get the n-dimensional style vector Index operator [i] to get the best of both worlds.

// this allows for generic sized vectors of type `T`. But no acces to members such as v.x, v.y, v.z.
pub struct VecN<T, const N: usize> {
    v: [T; N]
}

In C++ it’s possible to use an un-named union to have a vector which is both a struct of array and a struct of members. This way for 1-4 dimensional vectors you can access .xyzw data members.

template <typename T>
struct Vec<3, T>
{
    union {
        T v[3];
        struct
        {
            T x, y, z;
        };
        struct
        {
            T r, g, b;
        };
        swizzle_v3;
    };
}

Rust has this feature proposal for anonymous unions, so in future this could become possible and allow diect access to data array or members via a union. But for the time being that is not possible, so we can use the Index operator.

/// this allows for fixed sized vectors access to members such as v.x, v.y, v.z
pub struct Vec3<T> {
  x: T,
  y: T,
  z: T
}

impl<T> IndexMut<usize> for Vec3<T> {
    fn index_mut(&mut self, i: usize) -> &mut T {
        match i {
            0 => x
            1 => y
            2 => z
            _ => z // clamp out of bound access? 
        }
    }
}

fn test(v: Vec3) {
  // access like array
  let fx = v[0];
  // access as member
  let fx = v.x;
}

Debug performance is important and I wanted to try and keep the codegen as simple and as lean as possible. Coming from a C++ background I have a natural intuition toward compiler code generation in different scenarios. I set up a small example on godbolt to illustrate this of 2 dot product functions, one which is directly operating on a float and another which uses a template function. Before writing this I expected both to come out the same because all of the template work is done at compile time. The un-optimised version is not too far from optimisation level 1.

I was not prepared for what came when I started to do the same in Rust, again here is an example on godbolt. Even switching to very basic implementations I found un-optimised rust code generation when using generics to be significantly more bloated than a more direct implementation, even though all of the generics are being handled at compile time. In the Rust version you can switch the compiler optimization level to 1 and see that the resulting code generation for the dot product functions is identical… In the C++ example both result in the same code generation, regardless of optimisation level.

I was a little disappointed about the complexity of the code generation in debug builds, I am yet to build anything large scale so time will tell. I decided in the end to continue on the generic path and just to suck up the need for optimisation, I will profile the code in some real world scenarios when I get the chance.

Macros and Generics vs C++ Templates

I discovered that macros could be used to generate the concrete implementations for fixed sized vectors. This acted a bit like C++ templates which was somewhat of a surprise. Until this point I had associated Rust generics to be the equivalent to C++ templates, but in reality I came to understand that C++ templates are quite different. A C++ template won’t compile until it is instantiated, this allows you to write any old code inside a templated function:

struct TestStruct {
    float member;
};

template<typename T>
float function(T s) {
    return s.member;
}

Say you have a struct member s.member for every T you want to instantiate - you can ensure that the struct has a member called member and it’s type is a float everything will be OK. If you have a struct that does not have a member then you will get a compile error when you try to create an instance of T, but until that time comes you can implement the template function.

Rust generics don’t allow this - you cannot access raw data members and you need to create traits, associated methods or functions and supply trait bounds to access them in generic functions. This prevents you from creating functions that do not satisfy trait bounds in the first place. Rust declarative macros work a bit more like C++ templates; you can write any old code inside the macro and then you only know if it works or fails to compile when you try to instantiate the macro.

This was my first time using macro; at first they were quite tricky to get my head around but I eventually got used to the syntax. I tried to use them as much as possible but found when needing nested repetitions things got quite complicated, so I did opt for some manual work instead of packing everything inside macros:

// macro implementation of vec struct for Vec2, Vec3 and Vec4
macro_rules! vec_impl {
    ($VecN:ident { $($field:ident, $field_index:expr),* }, $len:expr, $module:ident) => {
        #[derive(Debug, Copy, Clone)]
        #[repr(C)]
        pub struct $VecN<T> {
            $(pub $field: T,)+
        }

        //... more macro code in / implementations in here
    }
}

vec_impl!(Vec2 { x, 0, y, 1 }, 2, v2);
vec_impl!(Vec3 { x, 0, y, 1, z, 2 }, 3, v3);
vec_impl!(Vec4 { x, 0, y, 1, z, 2, w, 3 }, 4, v4);

// manual implementation of dot products for Vec2, Vec3 and Vec4
/// trait for dot product
pub trait Dot<T> {
    /// vector dot-product
    fn dot(a: Self, b: Self) -> T;
}

impl<T> Dot<T> for Vec2<T> where T: Number {
    fn dot(a: Self, b: Self) -> T {
        a.x * b.x + a.y * b.y
    }
}

impl<T> Dot<T> for Vec3<T> where T: Number {
    fn dot(a: Self, b: Self) -> T {
        a.x * b.x + a.y * b.y + a.z * b.z
    }
}

impl<T> Dot<T> for Vec4<T> where T: Number {
    fn dot(a: Self, b: Self) -> T {
        a.x * b.x + a.y * b.y + a.z * b.z + a.w * b.w
    }
}

I ran into trouble with repetitions and horizontal operations on a vector. Having to add + to chain together repetitions means on something like a dot product you get v.x + v.y + v.z +; the trailing + kills compilation. For some other things, such as struct initialization, this is OK because Rust allows trailing ,. I had a look into the ? operator to run a repetition only once but couldn’t get it to work, maybe I am doing something wrong. I will revisit this at some point to try and get a better result. For the Eq op I also had the same issue; wanting to chain horizontal checks with && here inside the macro I just whack a true on the end and then close the parenthesis:

macro_rules! vec_impl {
    ($VecN:ident { $($field:ident, $field_index:expr),* }, $len:expr, $module:ident) => {
      impl<T> Eq for $VecN<T> where T: Eq  {}
      impl<T> PartialEq for $VecN<T> where T: PartialEq  {
          fn eq(&self, other: &Self) -> bool {
              $(self.$field == other.$field &&)+
              true // redundant check just to get it to compile
          }
      }
    }
}

For the dot product the same could be achieved by adding a T::zero() at the start or the end to make the repetition work. Now because it’s a zero and should be a const then this would be optimised away, but maybe not in debug? I did some tests to see - indeed it ended up generating 42 more assembly instructions in an un-optimised build and even a few extra instructions in optimisation level 1. In the end I decided to create and implement a Dot trait to avoid this and so I did not have to rely on compiler optimisation; at this point though with the amount of code that gets generated from the trait implementation it did feel a little like pissing in the wind but it’s better than nothing?

Associated Methods, Associated Functions and Generic Functions

Another hurdle was to get around Rust’s lack of function overloading. Of the other libraries I trialled I noticed they implemented the dot as so: v.dot(x) so the dot function implemented as an associated method dot(self, other: Self). While it’s a small difference, it is not what I wanted, which is dot(v, x) which could be implemented as an associated function dot(a: Self, b: Self). The main problem with this then becomes how can I call dot(x, v) without having to qualify it for the vector type Vec3::dot(x, v). The Vec3:: adds to code verbosity and I was keen to eliminate this.

Generics come to the rescue here by allowing a generic dot to be implemented that can take any width of vector. For any types implementing the Dot trait (which is all of the vector sizes I care about) we can make a single function which uses the Dot as a trait bound.

/// returns the vector dot product between a . b
pub fn dot<T: Number, V: VecN<T>>(a: V, b: V) -> T {
    V::dot(a, b)
}

It then dawned on me, I could share traits between vectors and scalar numerical types so that I could implement common functions such as min, max, clamp etc. In order to do this the number of traits exploded a little, but the end result was the ability to have tons of useful generic functions that could be called on any types (trait bounds permitting) which gave the same look and feel as a shader language.

/// returns the maximum of a and b
pub fn max<T: Number, V: NumberOps<T>>(a: V, b: V) -> V {
    V::max(a, b)
}

At first I was slightly confused why Rust didn’t implement its own numerical traits, I noticed a few crates which implemented traits for numerical types and operations but I wasn’t sure which ones I should use or why. There was also historical mention of numerical traits in rust, I later discovered they were removed from the standard library. In a bid to navigate this minefield I just decided to implement my own traits to add only exactly what I needed. I had to implement a Base trait (implemented by both vectors and scalars). Number for floats and ints, Float for floats, SignedNumber for signed types (signed integers and floats). Along with those operations that can be performed on those types NumberOps, SignedNumberOps and FloatOps. The base type traits are really just aggregations of arithmetic ops so they can be used as trait bounds inside other traits. The Ops types I added supply traits for things such as floor or ceil and round on floats. min, max and clamp on numbers etc. The vectors also implement the NumberOps, FloatOps and SignedNumberOps where the base T used in Vec supports the operations.

This is where I became very familiar with where clauses. It’s really quite cool to implement NumberOps for Vec where T: NumberOps or FloatOps for Vec where T: FloatOps. So we are saying for any vector of i32 we get number ops and for any vector of float we get both number ops and float ops. After implementing the various combinations of traits this gives the flexibility to supply trait bounds to generic functions, which allows me to use scalar or vector types!

let f : f32 = min(1.0, 2.0);
let v = min(vec2f(1.0, 1.0), vec2f(2.0, 2.0));

From With Tuples

I found the From trait to be quite useful. We can implement From multiple times with generic arguments From so I used ‘From’ to construct different sized vectors from one another; truncating them when assigning to smaller sizes or extending with zeros to larger sizes. One thing that isn’t possible to do though is to allow the function to take multiple values. The trait expects a single parameter passed into the fn from(other: T) function.

Tuples can provide almost the same functionality, when I thought of this idea I thought it was a cool hack! Just adding the need to supply an extra pair of parentheses to allow From multiple values. By using tuples in From functions I was able to create various combinations of constructors for vectors of different sizes combined together or with scalar values.

let v4 = Vec4f::from((v2, v2)); // vec4 from 2x v2's
let v3 = Vec3f::from((v2, 1.0)); // vec3 from 1x v2 and 1x scalar
let v2 = Vec2f::from((5.0, 6.0)); // vec2 from 2x scalars
let v4 = Vec4f::from((v2, 0.0, 1.0)); // vec4 from 1x v2 and 2x scalars
let v4 = Vec4f::from(v2); // vec4 from vec2 (splat 0's)
let v2 = Vec2f::from(v4); // vec2 from vec4 (truncate)

// construct rows from tuples
let m3v = Mat3f::from((
    vec3f(1.0, 2.0, 3.0),
    vec3f(4.0, 5.0, 6.0),
    vec3f(7.0, 8.0, 9.0)
));

Tuples also worked well to construct matrices from rows of vectors, or scalar values.

Test Driven Development

After ironing out the main structure of the API with all of the numerical traits, operations, vector and scalar combinations I started to do most of the grunt work implementing functions. For a maths library there are quite a lot of things you need to implement, but one thing that is quite nice about them is they are very small and very unit testable pieces of work. I went pure TDD here in a lot of cases, making the test first and then implementing functionality. Not strictly in all cases (I did write some functions first and then the tests later, soz… sue me!), but due to the nature of the code and iterating with tests this process was really enjoyable. In total there are currently 110 tests covering most areas, there are still a few missing pieces I will be adding over time and I hope to get some code coverage tools working to aid that process. You can take a look at the current tests here.

I am lucky enough to have a few different work laptops, one of which is the M1 MacBook Air that I had previously just been using for building and testing compatibility with the M1 and x86 builds. As I was going away a few times over the summer I decided to bring this laptop with me as it was small and lightweight compared to my MacBook Pro 16”. This was a revelation, I was able to code in all sorts of places: planes, trains and even by the pool! Combining this with the easily unit testable nature of maths code I made light work of the whole thing. I spent time on holiday, just a few minutes here and there, to implement another couple of tests and another couple of functions, over a 9-week period I implemented this whole thing, chipping away at it a small piece at a time.

For a lot of the tests I wrote some simple examples by hand, this included all of the vector and matrix arithmetic, constructors and so forth. These kinds of things are fairly easy to write down on paper. For intersection tests things get a bit more interesting; it’s easy to come up with a few trivial example tests (such as a line intersection with 2 axis aligned lines), which I added, but for more cases I already had a pretty comprehensive set that I had generated from this visual demo of my C++ library. I made interactive 3D samples of all of the available intersection tests, verified their correctness visually and then used some code generation injection to generate the tests. I made a python script that was able to convert the C++ test code into Rust test code.

A great benefit of having tests is the ability to refactor, as things progressed I saw new opportunities to make things more generic, which required refactoring some traits. I also was able to maintain good consistency in the API after introducing small issues such as in functions point_inside_aabb where I had ordered the arguments as (aabb_min: V, aabb_max: V, p: V) this doesn’t read so well as the arguments are in the opposite order to the function name. Some of these inconsistencies came from my C++ maths library implementation which is in use in a few projects making it more difficult to refactor, it was nice here to unify all of these details.

Features

Swizzles

Vector swizzling is a handy feature in shader language and getting some support into this Rust library feels like coming full-circle. I wrote my first Rust program permute whilst on holiday in 2019; the program can output all permutation combinations from some given inputs and an output format. I used the library in 2019 to generate C++ template code for vector swizzles in my C++ maths library.

I adapted the source slightly to output swizzles for Rust. I couldn’t quite get the swizzles to the same shader-style degree as the C++ implementation. It might be possible in future with support for [unnamed unions], but for the time being I generated traits and functions to return swizzled vectors of various sizes and a collection of set methods as well.

// swizzling
let wxyz = v4.wxyz(); // swizzle
let xyz = v4.xyz(); // truncate
let xxx = v4.xxx(); // and so on..
let xy = v3.yx(); // ..

// mutable swizzles
let mut v = Vec4f::zero();
x.set_xwyz(v); // set swizzle
x.set_xy(v.yx()); // assign truncated
x.set_yzx(v.zzz()); // etc.. 

Left-Hand Sided Scalar - Vector Arithmetic

In order to support left hand side scalar multiplication with vectors (as supported in shader languages) I had to implement arithmetic on foreign types to make this sort of thing possible:

// multiplying a scalar by vector results in a vector
float3 v = 1.0 * float3(1.0, 2.0, 3.0);

Initially I tried to do this once for all vectors of type but this was not permitted, resulting in the following error:

impl<T> Add<Vec2<T>> for T {
  // ^ type parameter `T` must be covered by another type when it appears before the first local type (`Vec2`)
}

I had to implement the ops for each primitive type I wanted. A macro allows for a single implementation but here I had to commit to concrete vector types I may want, I used a macro here to optimise the process.

/// macro to stamp out all arithmetic ops for lhs scalars
macro_rules! vec_scalar_lhs {
    ($VecN:ident { $($field:ident),+ }, $t:ident) => {
        impl Add<$VecN<$t>> for $t {
            type Output = $VecN<$t>;
            fn add(self, other: $VecN<$t>) -> $VecN<$t> {
                $VecN {
                    $($field: self + other.$field,)+
                }
            }
        }

        // other ops go here...
    }
}

Shorthand Constructors

I also added shorthand constructors which look like glsl - again here I needed to stamp out a concrete implementation, so I committed to the following types for both constructors and for left hand side scalar arithmetic:

vecf = 32-bit float vecd= 64-bit float veci = 32-bit signed integer vecu = 32-bit unsigned integer

From (primitive casts)

I also added a macro which creates From traits between these various primitive types so you can cast between vector of int to vector of float.

macro_rules! vec_cast {
    ($VecN:ident { $($field:ident),+ }, $t:ident, $u:ident) => {
        impl From<$VecN<$u>> for $VecN<$t> {
            fn from(other: $VecN<$u>) -> $VecN<$t> {
                $VecN {
                    $($field: other.$field as $t,)+
                }
            }
        }

        impl From<$VecN<$t>> for $VecN<$u> {
            fn from(other: $VecN<$t>) -> $VecN<$u> {
                $VecN {
                    $($field: other.$field as $u,)+
                }
            }
        }
    }
}

Some people may not require all of these types, so I have exposed the macros to create the constructors and arithmetic operation implementations for primitive types and added a feature in the Cargo.toml to disable these features should they wish.

Wrapping it all up

During this process I discovered many small inconsistencies and gaps in my C++ library so I took the opportunity to note these down and will revisit that when I get a chance.

There is still some more work to do in order to complete the project. I intend on using the library now to create a graphical demo in Rust using my in-progress graphics library, which can showcase the maths library’s features in a visual way, much like my C++ libraries live demo. This process was taking a while so I decided at this point to publish the project and add the graphical demo later so that I could write up this blog post while thoughts were fresh in my mind.

The final step was to publish on crates.io, the process is very simple… I added metadata to the Cargo.toml file, added a readme with some examples, added document comments, fixed all cargo clippy warnings and finally hit publish. The culmination of roughly 10 weeks of consistent work. It felt satisfying.