Optimization and profiling

Once you have a large game, with many large 3D models, you will probably start to wonder about the speed and memory usage.

1. Watch Frames Per Second

The main tool to measure your game speed is the Frames Per Second (FPS) number. Use the TCastleControlCustom.Fps or TCastleWindowCustom.Fps to get an instance of TFramesPerSecond. Look at these two numbers: TFramesPerSecond.RealFps and TFramesPerSecond.OnlyRenderFps.

1.1. The short version: what to watch

In short, watch the Window.Fps.ToString value, by displaying it anywhere. It intelligently combines some information, to show you how fast is your application.

Eventually, display directly the Window.Fps.RealFps value. In this case, it is easiest to have TCastleControlCustom.AutoRedisplay or TCastleWindowCustom.AutoRedisplay set to true, otherwise the meaning of RealFps may not actually indicate the potential speed of your application. It is true by default, so you're already set.

1.2. How to show the FPS value

  • If you use TCastleWindow, you can trivially turn on TCastleWindow.FpsShowOnCaption.

  • You can display FPS using TCastleLabel. See the manual page about using our user-interface classes. Just update the TCastleLabel.Caption in every OnUpdate event to show the current FPS value. An an example, see how examples/portable_game_skeleton/game.pas shows the FPS (search for Container.Fps.ToString there).

    Or you can display FPS using TCastleFont.Print in every Render event. See the manual about custom drawing. As an example, see how examples/physics/physics_3d_demo/game.pas shows the FPS (search for Container.Fps.ToString there).

    Display the value like Format('%f', [Window.Fps.RealFps]). Or, even better (and simpler), use Window.Fps.ToString. The Window.Fps.ToString shows more information, nicely formatted.

  • You can show the FPS value on some Lazarus label or form caption.

    Warning: do not change the Lazarus control too often (like every frame). Updating normal Lazarus controls all the time may slow your OpenGL context drastically. Also, do not write to the console (e.g. using Writeln) or anything else (e.g. using our WritelnLog) every frame — the very fact of doing this will slow down your application a lot.

    If you need to change some Lazarus control, or write the FPS to some log, use a timer (like TCastleTimer or Lazarus TTimer) to write it e.g. only once per second. The RealFps and OnlyRenderFps are actually just an average from the last second, so there's really no need to show them more often.

1.3. How to interpret Frames Per Second values?

There are two FPS numbers measured: "real FPS" and "only render FPS". "Only render FPS" is usually slightly larger. Larger is better, of course: it means that you have smoother animation.

Use "real FPS" to measure your overall game speed. This is the actual number of frames per second that we managed to display.

Caveats:

  • Make sure to have an animation that constantly updates your screen, or use AutoRedisplay = true (it is the default since CGE 6.0, so you're probably already set).

    Otherwise, we may not refresh the screen continously (no point to redraw, if both the scene and camera are completely static; this way we let other applications to work more smoothly, and we save your laptop battery). Then "real FPS" will drop to almost zero. This can be detected by looking at Window.Fps.WasSleeping. The output of Window.Fps.ToString also accounts for it, showing "no frames rendered" or "no need to render all frames".

  • If you hope to see higher values than 100 (the default LimitFPS value) then turn off "limit FPS" feature.

    • In games using TCastleWindow (if you use a standard program template, or manually call Window.ParseParameters) you can do it just by passing --no-limit-fps command-line option.
    • Or use view3dscene "Preferences -> Frames Per Second" menu item to set them to zero.
    • Or change LimitFPS global variable (if you use CastleControl unit with Lazarus) or change Application.LimitFPS (if you use CastleWindow unit) to zero. Changing them to zero disables the "limit fps" feature.

    You will also need to turn off "vertical synchronization" of the GPU to achieve arbitrarily high FPS.

  • Note that the monitor will actually drop some frames above it's frequency, like 60. (This is relevant only if "vertical synchronization" is off.)

    This may cause you to observe that above some threshold, FPS are "easier to gain" by optimizations, which may lead you to a false judgement about which optimizations are more useful than others. To make a good judgement about what is faster / slower, compare two versions of your program when only one thing changes.

"Only render FPS" measures how much frames we would get, if we ignore the time spent outside Render events. It's useful to compare it with "real FPS", large difference may indicate that you can make some optimizations in CPU code (e.g. collision detection or animations) to gain overall speed. Caveats:

  • Modern GPUs work in parallel to the CPU. So "how much time CPU spent in Render" doesn't necessarily relate to "how much time GPU spent on performing your drawing commands".

    For example: if you set LimitFPS to a small value (like 10), you may observe that "only render FPS" grows very high. Why? Because when the CPU is idle (which is often if LimitFPS is small), then GPU has a free time to finish rendering previous frame. So the GPU does the work for free, outside of Render time, when your CPU is busy waiting. OTOH when CPU works on producing new frames all the time, then you have to wait inside Render until previous frame finishes.

    In other words, improvements to "only render FPS" must be taken with a grain of salt. We spend less or more time in Render event: this does not always mean that we render more efficiently.

    Still, "only render FPS" is often a useful indicator.

If you see a large "only render FPS" value, much larger than "real FPS"...

It means that Render is quick. So we probably don't need to wait for the whole previous frame to finish when starting rendering a new frame. To some extent, that's good — you're probably doing useful work in the meantime on CPU, while GPU is working. Often it means that there is something to gain optimizing the CPU side, like collisions or animations.

No guarantees: It does not mean that you can actually achieve this number of FPS as "real FPS". At some point, decreasing CPU work will just uncover that we have to wait for GPU to finish anyway. In which case, you will observe "only render FPS" to drop (which is nothing alarming, it doesn't necessarily mean that rendering is less efficient; it just means that GPU speed becomes a factor too).

When "only render FPS" is almost equal to "real FPS"...

Then we spend most time in Render. This is normal if neither rendering nor collisions are a bottleneck — then we probably just spend time in the Render waiting for vertical synchronization to happen, and you can't really achieve more than 60 real FPS in the typical case with "vertical synchronization" turned on.

However, if your "real FPS" is much lower than your refresh rate, and your "only render FPS" is equal to "real FPS", then you probably can optimize the rendering. (Make smaller models, use less demanding shader effects etc.)

2. Making your games run fast

2.1. Basic rule: use small and static geometry, as much as possible

First of all, watch the number of vertexes and faces of the models you load. Use view3dscene menu item Help -> Scene Information for this.

Graphic effects dealing with dynamic and detailed lighting, like shadows or bump mapping, have a cost. So use them only if necessary. In case of static scenes, try to "bake" such lighting effects to regular textures (use e.g. Blender Bake functionality), instead of activating a costly runtime effect.

2.2. Compile in "release" mode for speed

Both Lazarus and our build tool support the idea of "build modes".

  • When you're in the middle of the development and you're testing the game for bugs, use the debug mode, that adds a lot of run-time checks to your code. This allows to get a clear and nice error when you e.g. access an invalid array index. If you use our build tool, just pass the --mode=debug command-line parameter to it.

    Our vectors are also like arrays, so doing stuff like MyVector[2] := 123.0; is also checked (it's valid if MyVector is a 3D or 4D vector, invalid if it's a 2D vector). Actually, this simple case is checked at compile-time with the new vector API in Castle Game Engine 6.3, but more convoluted cases are still checked at run-time.

  • When you need the maximum speed (when you want to build a "final" version for the player, or when you check / compare / profile the speed), always use the release mode.

    The code runs much faster in release mode. The speed difference may be really noticeable. As of Castle Game Engine 6.3, the ray-tracer is 1.9 times slower in development mode vs release mode. The speed differences of a typical game are usually not that drastic (since you don't spend 100% of your time calculating math expressions, unlike a ray-tracer), but significant differences are still expected, especially if you measure the performance of a particular calculation (not just looking at game FPS).

    So in most cases it's really important that you measure the speed only of the release build of your game, and this is the version that you want to provide to your players.

2.3. Backface culling

If the player can see the geometry faces only from one side, then backface culling should be on. This is the default case (X3D nodes like IndexedFaceSet have their solid field equal TRUE by default). It avoids useless drawing of the other side of the faces.

2.4. Textures

Optimize textures to increase the speed and lower GPU memory usage:

  • Use texture compression (makes GPU memory usage more efficient). You can do it very easily by using material properties and auto-compressing the textures using our build tool.
  • Scale down textures on low-end devices (desktops and mobiles). You can do it at loading, by using material properties and auto-downscaling the textures using our build tool, see TextureLoadingScale. Or you can do it at runtime, by GLTextureScale. Both of these approaches have their strengths, and can be combined.
  • Use texture atlases (try to reuse the whole X3D Appearance across many X3D shapes, if possible). This avoids texture switching when rendering, so the scene renders faster. When exporting from Spine, be sure to use atlases.
  • Use spite sheets (TSprite class) instead of separate images (like TGLVideo2D class). This again avoids texture switching when rendering, making the scene render faster. It also allows to easily use any texture size (not necessarily a power of two) for the frame size, and still compress the whole sprite, so it cooperates well with texture compression.
  • Don't set too high TextureProperties.anisotropicDegree if not needed. anisotropicDegree should only be set to values > 1 when it makes a visual difference in your case.

2.5. Animations

There are some TCastleScene features that are usually turned on, but in some special cases may be avoided:

  • Do not enable ProcessEvents if the scene should remain static.
  • Do not add ssDynamicCollisions to Scene.Spatial if you don't need better collisions than versus scene bounding box.
  • Do not add ssRendering to Scene.Spatial if the scene is always small on the screen, and so it's usually either completely visible or invisible. ssRendering adds frustum culling per-shape.

Various techniques to optimize animations include:

  • If your model has animations but is often not visible (outside of view frustum), then consider using Scene.AnimateOnlyWhenVisible := true (see TCastleSceneCore.AnimateOnlyWhenVisible).

  • If the model is small, and not updating it's animations every frame will not be noticeable, then consider setting Scene.AnimateSkipTicks to something larger than 0 (try 1 or 2). (see TCastleSceneCore.AnimateSkipTicks).

  • For some games, turning globally OptimizeExtensiveTransformations := true improves the speed. This works best when you animate multiple Transform nodes within every X3D scene, and some of these animated Transform nodes are children of other animated Transform nodes. A typical example is a skeleton animation, for example from Spine, with non-trivial bone hierarchy, and with multiple bones changing position and rotation every frame.

  • Watch out what you're changing in the X3D nodes. Most changes, in particular the ones that can be achieved by sending X3D events (these changes are kind of "suggested by the X3D standard" to be optimized) are fast. But some changes are very slow, cause rebuilding of scene structures, e.g. reorganizing X3D node hierarchy. So avoid doing it during game. To detect this, set LogSceneChanges := true and watch log (see manual chapter "Logging") for lines saying "ChangedAll" - these are costly rebuilds, avoid them during the game!

2.6. Create complex shapes, not trivial ones

Modern GPUs can "consume" a huge number of vertexes very fast, as long as they are provided to them in a single "batch" or "draw call".

In our engine, the "shape" is the unit of information we provide to GPU. It is simply a VRML/X3D shape. In most cases, it also corresponds to the 3D object you design in your 3D modeler, e.g. Blender 3D object in simple cases is exported to a single VRML/X3D shape (although it may be split into a couple of shapes if you use different materials/textures on it, as VRML/X3D is a little more limited (and also more GPU friendly)).

The general advice is to compromise:

  1. Do not make too many too trivial shapes. Do not make millions of shapes with only a few vertexes — each shape will be provided in a separate VBO to OpenGL, which isn't very efficient.

  2. Do not make too few shapes. Each shape is passed as a whole to OpenGL (splitting shape on the fly would cause unacceptable slowdown), and shapes may be culled using frustum culling or occlusion queries. By using only a few very large shapes, you make this culling worthless.

A rule of thumb is to keep your number of shapes in a scene between 100 and 1000. But that's really just a rule of thumb, different level designs will definitely have different considerations.

You can also look at the number of triangles in your shape. Only a few triangles for a shape is not optimal — we will waste resources by creating a lot of VBOs, each with only a few triangles (the engine cannot yet combine the shapes automatically). Instead, merge your shapes — to have hundreds or thousands of triangles in a single shape.

2.7. Share TCastleScenes instances if possible

  • To reduce memory usage, you can use the same TCastleScene instance many times within SceneManager.Items, usually wrapped in a different TCastleTransform. The whole code is ready for such "multiple uses" of a single scene instance.

    For an example of this approach, see frogger3d game (in particular, it's main unit game.pas). The game adds hundreds of 3D objects to SceneManager.Items, but there are only three TCastleScene instances (player, cylinder and level).

    However, this optimization is suitable only if all the visible scenes (that are actually a single TCastleScene instance) are always in the same animation frame (or maybe they are not animated at all). If you want to play different animations, you have to create separate TCastleScene instances (you can create them efficiently using the TCastleScene.Clone method).

  • In some cases, combining many TCastleScene instances into one helps. To do this, load your 3D models to TX3DRootNode using Load3D, and then create a new single TX3DRootNode instance that will have many other nodes as children. That is, create one new TX3DRootNode to keep them all, and for each scene add it's TX3DRootNode (wrapped in TTransformNode) to that single TX3DRootNode.

    This allows you to load multiple 3D files into a single TCastleScene, which may make stuff faster — there will be only one octree (used for collision routines and frustum culling) for the whole scene. Right now, we have an octree inside each TCastleScene, so it's not optimal to have thousands of TCastleScene instances with collision detection.

2.8. Collisions

If you include ssStaticCollisions or ssDynamicCollisions inside TCastleScene.Spatial, then we build a spatial structure (octree) that performs collisions with the actual triangles of your 3D model. This results in very precise collisions, but it can eat an unnecessary amount of memory (and, sometimes, take unnecessary amount of time) if you have a high-poly mesh. Often, many shapes don't need to have such precise collisions (e.g. a complicate 3D tree may be approximated using a simple cylinder representing tree trunk).

Use X3D Collision node to mark some shapes as non-collidable or to provide a simpler "proxy" shape to use for collisions. Right now, using the Collision requires writing X3D code manually, but it's really trivial. You can still export your scenes from 3D software, like Blender — you only need to manually write a "wrapper" X3D file around them.

You can also build a Collision node by code. We have a helper method for this: TCollisionNode.CollideAsBox.

Another possible octree optimization is to adjust the parameters how the octree is created. You can set octree parameters in VRML/X3D file or by ObjectPascal code. Although in practice I usually find that the default values are really good.

2.9. Avoid loading (especially from disk!) during the game

Avoid any loading (from disk to normal memory, or from normal memory to GPU memory) once the game is running. Doing this during the game will inevitably cause a small stutter, which breaks the smoothness of the gameplay. Everything necessary should be loaded at the beginning, possibly while showing some "loading..." screen to the user. Use TCastleSceneManager.PrepareResources to load everything referenced by your scenes to GPU.

Enable some (or all) of these flags to get extensive information in the log about all the loading that is happening:

  • LogTextureLoading
  • LogAllLoading
  • TextureMemoryProfiler.Enabled
  • LogRenderer (from CastleRenderer unit)

Beware: This is usually a lot of information, so you probably don't want to see it always. Dumping this information to the log will often cause a tremendous slowdown during loading stage, so do not bother to measure your loading speed when any of these flags are turned on. Use these flags only to detect if something "fishy" is happening during the gameplay.

2.10. Consider using occlusion query

The engine by default performs frustum culling, using per-shape and per-scene bounding boxes and spheres. If you add ssRendering flag to the Scene.Spatial, this will be even faster thanks to using shapes octree.

Using the hardware occlusion query is often a good idea in large city or indoor levels, where walls or large buildings can obscure a significant part of your geometry. Activate it by simply turnnig on the flag UseOcclusionQuery, like Scene.Attributes.UseOcclusionQuery := true. Note that our simple implementation may sometimes show a lag of 1 frame when the object is not rendered, but it should be.

You can also define custom culling methods. See the examples/3d_rendering_processing/fog_culling.lpr.

2.11. Blending

We use alpha blending to render partially transparent shapes. Blending is used automatically if you have a texture with a smooth alpha channel, or if your Material.transparency is less than 1.

Note: Just because your texture has some alpha channel, it doesn't mean that we use blending. By default, the engine analyses the alpha channel contents, to determine whether it indicates alpha blending (smooth alpha channel), alpha testing (all alpha values are either "0" or "1"), or maybe it's opaque (all alpha values equal "1"). You can always explicitly specify the texture alpha channel treatment using the alphaChannel field in X3D.

Rendering blending is a little costly, in a general case. The transparent shapes have to be sorted every frame. Hints to make it faster:

  • If possible, do not use many transparent shapes. This will keep the cost of sorting minimal.

  • If possible, turn off the sorting, using Scene.Attributes.BlendingSort := bsNone. See TBlendingSort for the explanation of possible BlendingSort values. Sorting is only necessary if you may see multiple partially-transparent shapes on the same screen pixel, otherwise sorting is a waste of time.

  • Sorting is also not necessary if you use some blending modes that make the order of rendering partially-transparent shapes irrelevant. For example, blending mode with srcFactor = "src_alpha" and destFactor = "one". You can use a blendMode field in X3D to set a specific blending mode. Of course, it will look differently, but maybe acceptably?

    So, consider changing the blending mode and then turning off sorting.

  • Finally, consider do you really need transparency by blending. Maybe you can work with a transparency by alpha testing? Alpha testing means that every pixel is either opaque, or completely transparent, depending on the alpha value. It's much more efficient to use, as alpha tested shapes can be rendered along with the normal, completely opaque shapes, and only the GPU cares about the actual "testing". There's no need for sorting. Also, alpha testing cooperates nicely with shadow maps.

    Whether the alpha testing looks good depends on your use-case, on your textures.

    To use alpha-testing, you can:

    1. Either make the alpha channel of your texture non-smooth, that is: every pixel should have alpha value equal to 0 or 1, never something in between. For example, in GIMP, increase the contrast (to maximum) of the alpha channel mask.
    2. Or you can force using alpha testing by using alphaChannel "TEST" in X3D

2.12. Loading PNG using libpng

By default, our engine uses FpImage to load various image formats, including PNG. This is comfortable, as it does not require any external libraries, and thus it instantly works (and in the same way) on all platforms. So you don't need to worry about using libpngXXX.dll on Windows, or linking with libpng on Android or iOS.

However, using external libpng is often much (even 4x) faster. That is because libpng allows to make various transformations during file reading (instead of processing the pixels later), and it doesn't force us to read using 16-bit-per-channel API (like FpImage does). So if you have a lot of PNG files, and want to speed up the loading process, consider switching to using external libpng.

To use external libpng library, just define -dCASTLE_PNG_DYNAMIC when compiling the engine. E.g. define it inside CastleEngineManifest.xml as <custom_options> and use our build tool to compile your game.

When testing or distributing the game, make sure that you have libpng and zlib available.

  • On Linux, FreeBSD, Mac OS X and other desktop Unix systems it's usually installed system-wide, so you don't need to worry.
  • On Windows, the build tool will make sure to include the appropriate DLLs when you call castle-engine package .... For testing, you can copy the appropriate DLLs to your game directory yourself, or copy them somewhere on $PATH. At the bottom of the getting started page we documented from where you can take these DLLs.
  • On Android and iOS, we will still use internal FpImage for now. (Modify castleconf.inc if you want to change it.)

3. Profile (measure speed and memory usage)

You can compile your application with the build tool using --mode=valgrind to get an executable ready to be tested with the magnificent Valgrind tool.

Instructions how to use Valgrind with Castle Game Engine applications are here.

In general, you can use any FPC tool to profile your code, for memory and speed. See also FPC wiki about profiling.

4. Measure memory use and watch out for memory leaks

4.1. Detect memory leaks with HeapTrc (-gh)

To detect memory leaks, we advice to regularly compile your code with FPC options -gl -gh. There are many ways to do this, for example you can add this to your fpc.cfg file (see FPC documentation "Configuration file" to know where you can find your fpc.cfg file):

#IFDEF DEBUG
-gh
-gl
#ENDIF

Then all the programs compiled in debug mode (with castle-engine compile --mode=debug, or with an explicit FPC option -dDEBUG) will automatically check for memory leaks.

The end result is that at the program's exit, you will get a very useful report about the allocated and not freed memory blocks, with a stack trace to the allocation call. This allows to easily detect and fix memory leaks.

If everything is OK, the output looks like this:

Heap dump by heaptrc unit
12161 memory blocks allocated : 2290438/2327696
12161 memory blocks freed     : 2290438/2327696
0 unfreed memory blocks : 0
True heap size : 1212416
True free heap : 1212416

But when you have a memory leak, it tells you about it, and tells you where the relevant memory was allocated, like this:

Heap dump by heaptrc unit
4150 memory blocks allocated : 1114698/1119344
4099 memory blocks freed     : 1105240/1109808
51 unfreed memory blocks : 9458
True heap size : 851968
True free heap : 834400
Should be : 835904
Call trace for block $00007F9B14E42980 size 44
  $0000000000402A83 line 162 of xxx.lpr
  ...

Note: when you exit with Halt, you will always have some memory leaks, that's unavoidable for now. You can ignore the "Heap dump by heaptrc unit" output in this case. Same thing if your program crashes with an unhandled exception.

Note: In the future, we may add -gl -gh automatically to the options added by the build tool in the debug mode. So programs compiled with castle-engine compile --mode=debug will automatically show this output.

4.2. Other tools

We do not have any engine-specific tool to measure memory usage or detect memory problems, as there are plenty of them available with FPC+Lazarus already. To simply see the memory usage, just use process monitor that comes with your OS. See also Lazarus units like LeakInfo.

You can use full-blown memory profilers like valgrind's massif with FPC code (see section "Profiling" above on this page about valgrind).