Optimization and profiling

Table of Contents

1. Introduction
2. Watch FPS (Frames Per Second)
3. Making your games run fast
4. Profile (measure speed and memory usage)
5. Measure memory use and watch out for memory leaks
- 5.1. Detect memory leaks with HeapTrc (-gh)
- 5.2. Other tools

1. Introduction

Once you have a large application, with many large 3D / 2D models, you will probably start to wonder about the speed and memory usage.

2. Watch FPS (Frames Per Second)

The main tool to measure your game speed is the Frames Per Second (FPS) value. Use the TCastleControl.Fps or TCastleWindow.Fps to get an instance of TFramesPerSecond. It contains two useful numbers (and some extra information): TFramesPerSecond.RealFps and TFramesPerSecond.OnlyRenderFps.

2.1. How to display the FPS value

You can display the Window.Fps.ToString value in any way you like.

If you use TCastleWindow, you can trivially turn on TCastleWindow.FpsShowOnCaption.
You can display FPS using TCastleLabel. See the manual page about using our user-interface classes. Just update the TCastleLabel.Caption in every OnUpdate event to show the current FPS value. An an example, create a new project using CGE editor — all new project templates include an FPS counter.

Or you can display FPS using TCastleFont.Print in every Render event. See the manual about custom drawing.
You can show the FPS value on some LCL label or form caption (if you use LCL forms).

Warning: do not change the Lazarus control too often (like every frame). Updating normal Lazarus controls all the time may slow your OpenGL context drastically. Also, do not write to the console (e.g. using Writeln) every frame — the very fact of doing this will slow down your application a lot.

If you need to change some Lazarus control, or write the FPS to some log, use a timer (like TCastleTimer or Lazarus TTimer) to write it e.g. only once per second. The RealFps and OnlyRenderFps are actually just an average from the last second, so there’s really no need to show them more often.

2.2. How to interpret the FPS value

There are two FPS numbers measured: "real FPS" and "only render FPS". "Only render FPS" is usually slightly larger. Larger is better, of course: it means that you have smoother animation.

Use "real FPS" to measure your overall game speed. This is the actual number of frames per second that we managed to display.

Caveats:

Make sure to have an animation that constantly updates your screen, or use AutoRedisplay = true (it is the default, so you’re probably already set).

Otherwise, we may not refresh the screen continuously (no point to redraw, if both the scene and camera are completely static; this way we let other applications to work more smoothly, and we save your laptop battery). Then "real FPS" will drop to almost zero. This can be detected by looking at Window.Fps.WasSleeping. The output of Window.Fps.ToString also accounts for it, showing "no frames rendered" or "no need to render all frames".
If you hope to see higher values than 120 (the default LimitFPS value) then turn off "limit FPS" feature.
- In games using TCastleWindow (if you use a standard program template, or manually call Window.ParseParameters) you can do it just by passing --no-limit-fps command-line option.
- Or use Castle Model Viewer "Preferences → Frames Per Second" menu item to set them to zero.
- Or change ApplicationProperties.LimitFPS to zero. Changing it to zero disables the "limit fps" feature.
  
  You will also need to turn off "vertical synchronization" of the GPU to achieve arbitrarily high FPS.
Note that the monitor will actually drop some frames above it’s frequency, like 60. (This is relevant only if "vertical synchronization" is off.)

This may cause you to observe that above some threshold, FPS are "easier to gain" by optimizations, which may lead you to a false judgement about which optimizations are more useful than others. To make a good judgement about what is faster / slower, compare two versions of your program when only one thing changes.

"Only render FPS" measures how much frames we would get, if we ignore the time spent outside Render events. It’s useful to compare it with "real FPS", large difference may indicate that you can make some optimizations in CPU code (e.g. collision detection or animations) to gain overall speed. Caveats:

Modern GPUs work in parallel to the CPU. So "how much time CPU spent in Render" doesn’t necessarily relate to "how much time GPU spent on performing your drawing commands".

For example: if you set LimitFPS to a small value (like 10), you may observe that "only render FPS" grows very high. Why? Because when the CPU is idle (which is often if LimitFPS is small), then GPU has a free time to finish rendering previous frame. So the GPU does the work for free, outside of Render time, when your CPU is busy waiting. OTOH when CPU works on producing new frames all the time, then you have to wait inside Render until previous frame finishes.

In other words, improvements to "only render FPS" must be taken with a grain of salt. We spend less or more time in Render event: this does not always mean that we render more efficiently.

Still, "only render FPS" is often a useful indicator.

If you see a large "only render FPS" value, much larger than "real FPS"…

It means that Render is quick. So we probably don’t need to wait for the whole previous frame to finish when starting rendering a new frame. To some extent, that’s good — you’re probably doing useful work in the meantime on CPU, while GPU is working. Often it means that there is something to gain optimizing the CPU side, like collisions or animations.

No guarantees: It does not mean that you can actually achieve this number of FPS as "real FPS". At some point, decreasing CPU work will just uncover that we have to wait for GPU to finish anyway. In which case, you will observe "only render FPS" to drop (which is nothing alarming, it doesn’t necessarily mean that rendering is less efficient; it just means that GPU speed becomes a factor too).

When "only render FPS" is almost equal to "real FPS"…

Then we spend most time in Render. This is normal if neither rendering nor collisions are a bottleneck — then we probably just spend time in the Render waiting for vertical synchronization to happen, and you can’t really achieve more than 60 real FPS in the typical case with "vertical synchronization" turned on.

However, if your "real FPS" is much lower than your refresh rate, and your "only render FPS" is equal to "real FPS", then you probably can optimize the rendering. (Make smaller models, use less demanding shader effects etc.)

2.3. Watch also viewport statistics

Another useful statistics to display is Viewport.Statistics.ToString. This shows how many scenes, and how many shapes, have been rendered in the last frame. It can be a useful guideline when to activate some specific optimizations discussed below. E.g. large value of displayed shapes may indicate that dynamic batching may be useful.

3. Making your games run fast

3.1. Basic rule: use small and static geometry, as much as possible

First of all, watch the number of vertexes and faces of the models you load. Use Castle Model Viewer menu item Help → Scene Information for this.

Graphic effects dealing with dynamic and detailed lighting, like shadows or bump mapping, have a cost. So use them only if necessary. In case of static scenes, try to "bake" such lighting effects to regular textures (use e.g. Blender Bake functionality), instead of activating a costly runtime effect.

3.2. Compile in "release" mode for speed

Our editor, build tool as well as Lazarus support the concept of "build modes".

When you’re in the middle of the development and you’re testing the game for bugs, use the debug mode, that adds a lot of run-time checks to your code. This allows to get a clear and nice error when you e.g. access an invalid array index. If you use our build tool, just pass the --mode=debug command-line parameter to it.

Our vectors are also like arrays, so doing stuff like MyVector[2] := 123.0; is also checked (it’s valid if MyVector is a 3D or 4D vector, invalid if it’s a 2D vector). Actually, this simple case is checked at compile-time with the vector API since Castle Game Engine 6.3, but more convoluted cases are still checked at run-time.
When you need the maximum speed (when you want to build a "final" version for the player, or when you check / compare / profile the speed), always use the release mode.

The code runs much faster in release mode. The speed difference may be really noticeable — though it depends on the exact application.
- For example, our "toy" software ray-tracer (CastleRayTracer unit) is 1.9 times slower in development mode vs release mode.
- Animating skin on CPU using HAnim X3D nodes for one test model (Lucy from Seamless3D) was ~52 FPS in release mode, ~20 FPS in debug mode. This is a big visual difference: release mode feels completely smooth while debug mode is noticeably choppy.
The speed differences of a typical game is usually not that drastic. Since a normal game doesn’t spend 100% of time calculating math expressions on CPU, unlike a software ray-tracer or skinned animation on CPU. And we usually calculate skinned animation on GPU when it uses Skin node or comes from glTF.

But significant differences are still expected. Especially if you measure the performance of a particular calculation (not just looking at overall game FPS).

So in most cases it’s really important that you measure the speed only of the release build of your game, and this is the version that you want to provide to your players.

3.3. Avoid rendering things that are not going to be visible

A common theme in many optimizations is culling, in which we avoid passing certain geometry to GPU, because we can quickly determine that it’s not going to be visible.

3.3.1. Frustum culling (done by default)

The engine by default performs frustum culling, using per-shape and per-scene bounding boxes and spheres. TCastleScene.ShapeFrustumCulling and TCastleScene.SceneFrustumCulling control this.

The simplest advise is to keep them enabled, let the engine do its job :) It’s almost never useful to disable these checks, unless you have a very specific case where you just know user is going to see something in all frames, and the test really consumes time (which in practice is never true, the test is trivial).

Moreover, enable TCastleSceneCore.PreciseCollisions to have per-shape frustum culling be done using shapes octree. This makes it faster. Although it consumes additional time to build and update the octree. It’s usually a good idea for large 3D models like a game level.

3.3.2. Backface culling (just make sure your models enable it)

Often the viewer can see the geometry faces only from one side, when the mesh is watertight (see also shadow volumes that require 2-manifold objects).

In such case, backface culling should be on. This is the default case (X3D nodes like IndexedFaceSet have their solid field equal TRUE by default). It avoids useless drawing of the other side of the faces.

When exporting 3D models from authoring software like Blender, make sure that the appropriate checkbox saying "Backface Culling" is enabled.

3.3.3. Distance culling

Sometimes you can avoid rendering objects too far from the camera using TCastleScene.DistanceCulling. See the examples/viewport_and_scenes/fog_and_distance_culling for a simplest example. This is a natural optimization when you have a fog and/or a large outdoor world.

3.3.4. Occlusion culling

Using the occlusion culling is often a good idea in large city or indoor levels, where walls or large buildings can obscure a significant part of your geometry. Activate it by TCastleViewport.OcclusionCulling, see occlusion culling docs for details.

3.3.5. LODs

Use LOD (level of detail). This is not strictly about eliminating objects that are invisible from rendering, but we mention it here, as it’s related: it allows to replace complex objects with simpler objects, depending on the camera distance. See examples/viewport_and_scenes/level_of_detail_demo demo for a simplest example how setup LODs (and optionally combine them with TCastleTransformReference).

3.4. Textures

Optimize textures to increase the speed and lower GPU memory usage:

Use texture compression (makes GPU memory usage more efficient). You can do it using material properties and auto-compressing the textures using our build tool.
Scale down textures on low-end devices (desktops and mobiles). You can do it at loading using material properties and auto-downscaling the textures using our build tool, see TextureLoadingScale. Or you can do it at runtime, by GLTextureScale. Both of these approaches have their strengths, and can be combined.
Use texture atlases (try to reuse the whole X3D Appearance across many X3D shapes, if possible). This avoids texture switching when rendering, so the scene renders faster. When exporting from Spine, be sure to use atlases.
Use sprite sheets instead of separate images (like TGLVideo2D class). This again avoids texture switching when rendering, making the scene render faster. It also allows to easily use any texture size (not necessarily a power of two) for the frame size, and still compress the whole sprite, so it cooperates well with texture compression.
Don’t set too high TextureProperties.anisotropicDegree if not needed. anisotropicDegree should only be set to values > 1 when it makes a visual difference in your case.

3.5. Animations

There are some TCastleScene features that are usually turned on, but in some special cases may be avoided:

Do not enable TCastleSceneCore.ProcessEvents if the scene remains static.
Do not enable TCastleSceneCore.PreciseCollisions if you don’t need precise collisions (treating scene as a mesh, except when skinned animation is used) and simpler collisions (treating scene as bounding box) are enough.

We have an example examples/animations/optimize_animations_test demonstrating a few possible animations optimizations discussed below. Read the README there.

Various techniques to optimize animations include:

If your model has animations but is often not visible (outside of view frustum), then consider using Scene.AnimateOnlyWhenVisible := true (see TCastleSceneCore.AnimateOnlyWhenVisible).
If the model is small, and not updating it’s animations every frame will not be noticeable, then consider setting Scene.AnimateSkipTicks to something larger than 0 (try 1 or 2). (see TCastleSceneCore.AnimateSkipTicks).

Watch out what you’re changing in the X3D nodes. Most changes, in particular the ones that can be achieved by sending X3D events (these changes are kind of "suggested by the X3D standard" to be optimized) are fast. But some changes are very slow, cause rebuilding of scene structures, e.g. reorganizing X3D node hierarchy. So avoid doing them during game. How to detect if long "ChangedAll" occurs:
- Set LogChanges := true and watch log for lines saying ChangedAll.
- Set Profiler.Enabled := true and watch log for profiler of long ChangedAll calls.

3.6. Shading, Lighting Model

PBR (Physically Based Rendering, Lighting Model) and Phong Shading look great and modern, but they have a cost.

PBR is default when using glTF, available also in X3D if you use TPhysicalMaterialNode. Phong Shading is default across the engine for all models.

Note	When reading this section, do not confuse Phong shading with Phong lighting model.

If you’re fine with a more "retro" lighting look, you can gain some speed by:

Using Phong lighting model instead of PBR.

To do this:
- If your models are in glTF format, load with TCastleSceneLoadOptions.GltfPhongMaterials set to true.
- If your models are in X3D format, use Material node (TMaterialNode) instead of PhysicalMaterial node (TPhysicalMaterialNode).
Using Gouraud shading instead of Phong shading.

To do this set MyScene.RenderOptions.PhongShading to false for all your scenes.

3.7. Light Sources Radius

When designing lights, limit their scope or radius. When creating lights in new Blender, select "Custom Distance" at light. This limits the shapes where the light has to be taken into account.

3.8. Create complex shapes, not trivial ones

Modern GPUs can "consume" a huge number of vertexes very fast, as long as they are provided to them in a single "batch" or "draw call".

In our engine, the "shape" is the unit of information we provide to GPU. It is simply a X3D shape. In most cases, it also corresponds to the 3D object you design in your 3D modeler, e.g. Blender 3D object in simple cases is exported to a single X3D shape (although it may be split into a couple of shapes if you use different materials/textures on it, as X3D is a little more limited (and also more GPU friendly)).

The general advice is to compromise:

Do not make too many too trivial shapes. Do not make millions of shapes with only a few vertexes — each shape will be provided in a separate VBO to OpenGL, which isn’t very efficient.
Do not make too few shapes.

Each shape is passed as a whole to GPU (splitting shape on the fly would cause unacceptable slowdown), and shapes may be culled on CPU using various culling techniques listed on this page (frustum culling, occlusion culling and more). By using only a few very large shapes, you make these culling algorithms worthless.

A rule of thumb is to keep your number of shapes in a scene between 100 and 1000. But that’s really just a rule of thumb, different level designs will definitely have different considerations.

You can also look at the number of triangles in your shape. Only a few triangles for a shape is not optimal — we will waste resources by creating a lot of VBOs, each with only a few triangles (the engine cannot yet combine the shapes automatically). Instead, merge your shapes — to have hundreds or thousands of triangles in a single shape.

3.8.1. Try dynamic batching

If you have a large number of small shapes using the same shader, consider turning on DynamicBatching. This will internally detect and merge multiple shapes into one just before passing them to the GPU. In some cases, it is a very powerful optimization, reducing the number of draw calls.

Watch the Viewport.Statistics.ToString to see whether it reduces the number of rendered shapes.

It is particularly useful e.g. to optimize Spine rendering, as 2D animated models are often composed from a number of trivial textured quads that are transformed each frame. Dynamic batching can drastically reduce the number of draw calls in this case.

3.9. Share TCastleScenes instances if possible

3.9.1. Reuse the same TCastleScene instance many times (e.g. by TCastleTransformReference)

To reduce memory usage, you can use the same TCastleScene instance many times within Viewport.Items.

One way to do this is just to add, from Pascal code, the same TCastleScene instance many times to Viewport.Items.

See the "Multiple instances of the same scene" section of the manual "Writing code to modify scenes and transformations" for an example.
Another way to ensure such sharing (that results in the same sharing underneath) is to use TCastleTransformReference. This approach can also be used at design-time, i.e. you set set-up such sharing in CGE editor.

Examples that use it include examples/terrain (for trees) and examples/viewport_and_scenes/shadows_distance_culling.

However, this optimization is suitable only if the scene should always be in the same animation frame (or not animated at all). If you want to play different animations, you have to create separate TCastleScene instances (you can create them efficiently using the TCastleScene.Clone method).

3.9.2. Maybe combine many small models into one TCastleScene instance

In some cases, combining many TCastleScene instances into one helps. To do this, load your 3D models to TX3DRootNode using LoadNode, and then create a new single TX3DRootNode instance that will have many other nodes as children. That is, create one new TX3DRootNode to keep them all, and for each scene add it’s TX3DRootNode (wrapped in TTransformNode) to that single TX3DRootNode.

This allows you to load multiple 3D files into a single TCastleScene, which may make stuff faster — there will be only one octree (used for collision routines and frustum culling) for the whole scene. Right now, we have an octree inside each TCastleScene, so it’s not optimal to have thousands of TCastleScene instances with collision detection.

See the manual page Transformation hierarchy for a detailed discussion of this, and when it may be a good idea to merge scenes.

Note that we do not advise using this optimization too hastily. It sometimes makes sense, but usually having one TCastleScene for each one model (that is, not combining them) is better:

It makes code simpler. You trivially load each model by TCastleScene.Load. You don’t need to deal or understand anything about X3D nodes.
It allows to run animations in the most intuitive way: on each model, you can call TCastleScene.PlayAnimation.
The physics engine right now treats an entire TCastleTransform (like TCastleScene) as a single rigid body. You cannot combine two scenes, if you want them to be independent rigid bodies for the physics engine.

Various things discussed here are planned to be improved in the engine, to avoid leaving you with such difficult decision. On one side, we plan to merge the TCastleTransform and TTransformNode hierarchies, making the gain from merging scenes irrelevant. On the other hand, we plan to allow physics to treat specific shapes as rigid bodies, making it possible to apply physics on smaller units than "entire TCastleScene".

3.10. Collisions

If you enable TCastleSceneCore.PreciseCollisions, then we build a spatial structure (octree) that performs collisions with the actual triangles of your 3D model. This results in very precise collisions, but it can eat an unnecessary amount of memory (and, sometimes, take unnecessary amount of time) if you have a high-poly mesh. Often, many shapes don’t need to have such precise collisions (e.g. a complicated 3D tree may be approximated using a simple cylinder representing tree trunk).

Note that collisions with skinned animated objects automatically use their bounding box, to avoid rebuilding octrees every frame.

If you want to keep using TCastleSceneCore.PreciseCollisions but eliminate particular scene subset from colliding, you can use TCollisionNode which is an X3D node. This allows to mark some shapes as non-collidable or to provide a simpler "proxy" shape to use for collisions. Using the Collision requires writing X3D code manually, but it’s really simple. You can still export your scenes from 3D software, like Blender — you only need to manually write a "wrapper" X3D file around them.

An example X3D file showing this technique: tree from "Wyrd Forest" game.
More examples are in vrml_2/collisions_final.wrl demo inside our demo models.

You can also build a Collision node by code. We have a helper method for this: TCollisionNode.CollideAsBox.

3.11. Avoid loading (especially from disk!) during the game

Avoid any loading (from disk to normal memory, or from normal memory to GPU memory) once the game is running. Doing this during the game will inevitably cause a small stutter, which breaks the smoothness of the gameplay. Everything necessary should be loaded at the beginning, possibly while showing some "loading…" screen to the user.

3.11.1. Prepare resources

Use TCastleViewport.PrepareResources to load everything referenced by your scenes to GPU. Be sure to pass all the TCastleScene instances to TCastleViewport.PrepareResources in the "loading" stage.

3.11.2. Log loading

Enable some (or all) of these flags to get extensive information in the log about all the loading that is happening:

LogTextureLoading
LogAllLoading
TextureMemoryProfiler.Enabled
TSoundEngine.LogSoundLoading
TCastleView.Log
Also enabling Profiler.Enabled and doing WritelnLog(Profiler.Summary) is a great way to be informed about most loading.

Beware: Some of these flags (in particular LogAllLoading) can produce a lot of information, and you probably don’t want to see it always. Dumping this information to the log may even cause a noticeable slowdown during loading stage, so do not bother to measure your loading speed when any of these flags are turned on and you see they produce a lot of output.

You can also use TCastleProfiler to easily get information about what was loaded, and what took most time to load.

3.12. Blending

We use alpha blending to render partially transparent shapes. Blending is used automatically if you have a texture with a smooth alpha channel, or if your Material.transparency is less than 1.

Note: Just because your texture has some alpha channel, it doesn’t mean that we use blending. By default, the engine analyses the alpha channel contents, to determine whether it indicates alpha blending (smooth alpha channel), alpha testing (all alpha values are either "0" or "1"), or maybe it’s opaque (all alpha values equal "1"). You can always explicitly specify the texture alpha channel treatment using the alphaChannel field in X3D. You can also explicitly specify the alpha mode for a given shape, see alpha blending for details.

Rendering blending is a little costly, in a general case. The transparent shapes have to be sorted every frame. Hints to make it faster:

If possible, do not use many transparent shapes. This will keep the cost of sorting minimal.
If possible, turn off the sorting, setting TCastleViewport.BlendingSort to sortNone.

Sorting is only necessary if you may see multiple partially-transparent shapes on the same screen pixel.
You can make sorting unnecessary by using blending modes that make the order of rendering partially-transparent shapes irrelevant. For example, blending mode with srcFactor = "src_alpha" and destFactor = "one". You can use a blendMode field in X3D to set a specific blending mode. Of course, it will look differently, but maybe acceptably?

So, consider changing the blending mode and then turning off sorting.
Finally, consider do you really need transparency by blending. Maybe you can work with a transparency by alpha testing? Alpha testing means that every pixel is either opaque, or completely transparent, depending on the alpha value. It’s much more efficient to use, as alpha tested shapes can be rendered along with the normal, completely opaque shapes, and only the GPU cares about the actual "testing". There’s no need for sorting. Also, alpha testing cooperates nicely with shadow maps.

Whether the alpha testing looks good depends on your use-case, on your textures.

To use alpha-testing, you can:
1. Either make the alpha channel of your texture non-smooth, that is: every pixel should have alpha value equal to 0 or 1, never something in between. For example, in GIMP, increase the contrast (to maximum) of the alpha channel mask.
2. Or you can force using alpha testing by using alphaChannel "TEST" in X3D.

3.13. Loading PNG using libpng

Castle Game Engine can use Libpng (faster, but requires external library) or FpImage (always possible, on all platforms) to load PNG.

FPImage does not require any external libraries, and thus it instantly works (and in the same way) on all platforms. However, external Libpng is often much (even 4x) faster. That is because Libpng allows to make various transformations during file reading (instead of processing the pixels later), and it doesn’t force us to read using 16-bit-per-channel API (like FpImage does).

We will automatically use Libpng if detected (and fallback on FPImage otherwise).

On Linux, FreeBSD, macOS and other desktop Unix systems it’s usually installed system-wide, so you don’t need to worry.
On Windows, make sure to distribute Libpng alongside your exe. Our build tool takes care of this for you: it will copy appropriate DLL files when you do castle-engine compile … or castle-engine package ….

3.14. User interface and 2D drawing

Turn on TCastleUserInterface.Culling to optimize the case when a resource-intensive control is often off-screen (and thus doesn’t need to be rendered or process other events). This also matters if the control is outside of the parent scrollable view (TCastleScrollView) or other parent with TCastleUserInterface.ClipChildren. This is very useful when creating a large number of children inside TCastleScrollView.

When rendering 2D stuff yourself using TDrawableImage, you can often make a dramatic speedup by using the overload that draws multiple images (maybe different, maybe the same image parts) by a single procedure TDrawableImage.Draw(ScreenRects, ImageRects: PFloatRectangleArray; const Count: Integer); call.

Try turning on TCastleContainer.UserInterfaceBatching. It reduces the number of draw calls needed in some cases to draw UI, which may provide a speedup. See examples/user_interface/ui_batching for example how to use it and measure it.

3.15. Last resort: consider switching to old rendering pipeline for really old machines

Our TGLFeatures.RequestCapabilities allow to force rendering using an ancient fixed-function pipeline. This is a rather "nuclear" way to resign from many benefits of modern rendering (PBR, Phong shading) and instead have something that is faster on many old GPUs, that have been optimized for fixed-function pipeline.

Note that many applications will look different (and worse) because of the unavoidable difference. If you use glTF with PBR, then switching to fixed-function will disable PBR and force older Phong lighting model with Gouraud shading.

To try this, set

TGLFeatures.RequestCapabilities := rcForceFixedFunction;

before creating the window (e.g. in the initialization section of GameInitialize unit in your application).

Users can also run any application with command-line option --capabilities=force-fixed-function.

4. Profile (measure speed and memory usage)

4.1. Use TCastleProfiler to measure time of tasks and their sub-tasks

Use TCastleProfiler (through the singleton Profiler, in CastleTimeUtils unit) to easily profile the speed of various tasks. You can measure the speed of your own code, or just enable the profiler to measure the speed of various engine loading operations. The profiler automatically builds and sorts a tree of "which sub-tasks contribute to the time of each task", so you can investigate "what took most time in something else", e.g. loading which 3D model took the most time when loading a game level.

Usage:

Enable it by calling
```
Profiler.Enabled := true;
```
Usually you want to do this as early as possible, e.g. from the initialization section of your main unit like GameInitialize.
Surround the code you want to measure with Profiler.Start and Profiler.Stop calls. Like this:
```
procedure TMyClass.LoadSomething;
var
  TimeStart: TCastleProfilerTime;
begin
  TimeStart := Profiler.Start('Loading something (in TMyClass)');
  try
    // do the time-consuming loading now...
  finally
    Profiler.Stop(TimeStart);
  end;
end;
```
The engine automatically measures the speed of various loading operations, like loading images, sounds, 3D models. So you don’t actually need to add any Profiler.Start / Profiler.Stop calls if you just want to measure the speed of loading assets.
You want to output the profiling information at some point.
- For a simple output of everything captured so far, just use Profiler.Summary anytime you want. For example write it to the log file when some button is pressed:
  
  WritelnLog(Profiler.Summary);
- If you measure some specific task, you can output only this task (including sub-tasks that happened within) by passing additional argument to the Profiler.Stop. Like this:
  
  Profiler.Stop(TimeStart, true);
- If you want to measure the time of view starting, just set TCastleView.Log to true early (e.g. in initialization of GameInitialize unit) like this:
  
  TCastleView.Log := true;

Expect an output like this:

-------------------- TCastleApplication Initialization begin
2.87 [2.87] TCastleApplication Initialization
> 2.87 [2.87] - TCastleApplication.OnInitialize
> > 1.91 [1.91] - Loading "castle-data:/level/level.gltf" (TCastleSceneCore)
> > > 1.44 [1.44] - ChangedAll for Scene1 from castle-data:/level/level.gltf
> > > > 0.35 [0.35] - Creating octree for shape Circle.001/Circle.001_2/Circle.001_Primitive0/IndexedTriangleSet
> > > > 0.21 [0.21] - Creating octree for shape Plane.003/Plane.003_2/Plane.003_Primitive0/IndexedTriangleSet
> > > > 0.12 [0.12] - Creating octree for shape Stairs_2/Circle/Circle_Primitive0/IndexedTriangleSet
> > > > 0.04 [0.04] - Creating octree for shape Cube.052/Cube.069/Cube.069_Primitive0/IndexedTriangleSet
....
-------------------- TCastleApplication Initialization end

4.2. Use TCastleFrameProfiler (just press F8!) to measure what consumes your time

You can activate inspector by F8 to view the "frame profiler" easily:

We have TCastleFrameProfiler to profile the time spend in a particular frame (from one OnUpdate start to another). Use this to track short tasks that occur within a frame. The engine automatically tracks there some operations (just enable FrameProfiler.Enabled := true and look in the log for results), you can also track other operations (specific to your game). An example output looks like this:

-------------------- FrameProfiler begin
Frame time: 0.02 secs (we should have 51.22 FPS based on this):
- BeforeRender: 0%
- Render: 88% (0.02 secs, we should have 58.34 "only render FPS" based on this)
  - TCastleTransform.Render transformation: 0%
  - TCastleScene.Render: 47%
    - ShapesFilterBlending: 1%
- Update: 12%
  - TCastleSceneCore.Update: 3%
- Other:
-------------------- FrameProfiler end

This example output shows that:

The majority of the work (88%) is spent doing rendering.
One conclusion is that optimizing animations (in TCastleSceneCore.Update) will not gain you much, as they only take 3% of time.
If you would like to optimize, in this particular example you should think:
1. can I optimize rendering (TCastleScene.Render),
2. what else eats time in Render (there’s a large difference between Render and TCastleScene.Render, so what is consuming the 41%?).

4.3. Use Valgrind, incredibly powerful profiler on Linux

You can compile your application with the build tool using --mode=valgrind to get an executable ready to be tested with the magnificent Valgrind tool. Read instructions how to use Valgrind with Castle Game Engine applications.

4.4. Use profiler on Nintendo Switch

On Nintendo Switch, another profiler is available. More information is available in the Nintendo Switch-specific documentation of CGE (only for registered developers on Nintendo).

4.5. Use any other profiler for FPC

In general, you can use any FPC tool to profile your code, for memory and speed. See also FPC wiki about profiling.

5. Measure memory use and watch out for memory leaks

5.1. Detect memory leaks with HeapTrc (-gh)

We strongly advise to detect memory leaks automatically using HeapTrc (FPC) or ReportMemoryLeaksOnShutdown (Delphi). The details how to do it are here.

5.2. Other tools

We do not have any engine-specific tool to measure memory usage or detect memory problems, as there are plenty of them available with FPC+Lazarus already. To simply see the memory usage, just use process monitor that comes with your OS. See also Lazarus units like LeakInfo.

You can use full-blown memory profilers like valgrind’s massif with FPC code (see section "Profiling" above on this page about valgrind).

To improve this documentation just edit this page and create a pull request to cge-www repository.