Haven't had much time to do any further development due to a new job, but I'll find a few minutes on the weekend to finally finish the queue execution system. The server is running well with just one queue per stage with N threads per queue. I broke down and started using spin locks as the performance under load is superior to a critical section (which is even faster than a mutex obviously).
The big change will be the removal of application specific objects and relying more on the AOSContext to manage data. This change will allow executions to be even more parallel which is essential for a very high request volume.
The other fun thing I have been working on is building an indexed blob for query strings that will allow database query against it without having to parse the name/value pairs. I have been looking into a simple database index structure to build it, hopefully once finished it will allow fast lookup of name/value pairs in a database without extensive logic.