CPU profiling
guide

Introduction

Unity is one of the most widely-used 3D graphics engines for game development. This is a general guide focusing on CPU profiling for Unity, including which tools are useful for profiling and how to use these tools to find hotspots in your code.

Key factors for CPU gaming profiling

1. Low performance impact

Profiling must have minimal overhead. The accuracy of performance data determines the focus of your optimization efforts, so high profiling overhead creates too much observer effect in the data.

2. GPU queue

The GPU queue and its usage are helpful in determining if a game is CPU-bound. They can also help to determine the upper bounds on how much we can expect to improve through CPU optimization alone.

3. CPU call stack

The CPU call stack is essential to determine what the CPU is doing when the GPU queue is not full. This is key to finding the hotspot functions which are blocking the GPU.

4. Hardware metric

CPU hardware metrics are useful to find what the hottest part of a hotspot function is and its root cause from a hardware perspective.

Profiling tools

Unity Profiler

The Unity profiler is a powerful tool. It is useful when developing games, but requires the use of a development build. This type of build can introduce too much overhead in the app to obtain accurate performance data on the final game. For this reason, the Unity Profiler may not be suitable for precise CPU performance profiling of release builds.

Event Tracing and AMD μProf

Both event tracing and AMD μProf are relatively low overhead and work with release builds. Event Tracing can capture the GPU queue and CPU call stacks and μProf can capture hardware metrics, so the next question is if they are suitable for Unity.

Sample

Unity is different from many engines as it uses C# and not C++ as the primary programming language for development. Traditional profiling tools like Event Tracing and μProf may not suit Unity.

The Unity ArmRobot demo is a simple sample we can use to show if CPU profiling with Unity can meet the key factors mentioned in the prior section.

1. Mono build

First, build the release version of the ArticulationRobot scene with the default configuration, which is Mono, to see if it can meet the key factors.

GPU queue

Capture with Event Tracing and open it with GPUView. The GPU queue is accessible and the GPU usage is less than 6%. Expand the ArmRobot.exe process below. VSync is on as the “Device Context” is full, so Mono does not impact GPU queue. VSync on

CPU call stack

  • Open the same Event Tracing files with Windows Performance Analyzer.
  • Add “https://symbolserver.unity3d.com” into the symbol paths to resolve names in UnityPlayer.dll.
  • Add the path of GameAssembly.pdb into the symbol paths to resolve names in GameAssembly.dll.
  • Expand the call stack in the “CPU Usage (Sampled)” tab. The function names in UnityPlayer.dll are resolved well but “?!?” is shown under Mono jit runtime which is invoked by BehaviourManager.

Mono stack

As we can see, Mono blocks the CPU call stack as Event Tracing is not able to capture Mono IL call stacks. Even CLR tracing doesn’t work as it only works with CoreCLR, and not Mono.

Hardware metrics

As the CPU call stacks of Mono functions are inaccessible, the hardware metrics also do not help much. This is because the hotspot functions are unknown, except for those in UnityPlayer.dll.

2. IL2CPP build without VSync

Because Mono blocks the CPU call stack, the only remaining choice is IL2CPP. The IL2CPP backend converts IL code into C++ code which may generate readable CPU call stacks. We can see VSync is off this time.

GPU queue

Capture with Event Tracing again and open it in GPUView. Now the GPU queue is full, and the GPU usage is more than 92%. VSync off

CPU call stack

  • Open the same Event Tracing files with Windows Performance Analyzer again.
  • Add “https://symbolserver.unity3d.com” into the symbol paths to resolve names in UnityPlayer.dll.
  • Add the path of GameAssembly.pdb into the symbol paths to resolve names in GameAssembly.dll.
  • Expand the call stack in the “CPU Usage (Sampled)” tab. Now the readable function names are shown under IL2CPP vm which is invoked by BehaviourManager.

il2cpp stack

IL2CPP generated names are formatted as “{component name}_{m or t}_{unique number}” which is useful to find the right function in C# scripts.

For example:

  • Vector3_ToString_m2315 means Vector3::ToString(), “_m” means it is a method and 2315 is the unique number to prevent naming conflict.
  • ObjectU5BU5D_t4 is the name of the type System.Object[], “_t” means type and 4 is the unique number.

So, the hotspots of BehaviourManager are ArticulationHandManualInput::Update() and RobotManualInput::Update() methods. This makes sense as this demo simply articulates a robot arm according to the user’s inputs.

Hardware metrics

If AMD μProf is not able to handle the Unity symbol server address, try the following workaround:

  • Copy the UnityPlayer_Win64_il2cpp_x64.pdb downloaded by Windows Performance Analyzer before from the Unity symbol server and paste it to the same path where UnityPlayer.dll is located.
  • Copy the GameAssembly.pdb to the same path as GameAssembly.dll.
  • Launch the game, attach to ArmRobot.exe process, and capture the data.

AMD μProf now shows the correct function names and the related metrics no matter if it is in UnityPlayer or GameAssembly. uprof

Double click RobotManualInput::Update() method to open the sources’ view. Find the matched IL2CPP generated cpp source code. This matches the assemble code well. It looks like one hot part in this function is caused by “Array bounds checks.” This can be disabled by Il2CppSetOption. source

Summary

  • Profiling release builds has the least performance overhead possible, always try to optimize release builds.
  • GPU queue works with both Mono and IL2CPP.
  • Only IL2CPP generates event tracing recognizable CPU call stacks.
  • AMD μProf works well with IL2CPP builds and the metrics can match the IL2CPP generated temporary cpp codes.
  • Profiling an IL2CPP build is better than a Mono build for recognizable CPU call stacks and improved performance.
  • How to optimize Mono IL to cpp codes of IL2CPP is a complex topic of its own. IL2CPP-related articles at Unity blog may help a lot but do read them carefully.

Future

The keywords of this topic are Mono and IL2CPP, but this is not the end. Unity intends to migrate from the Mono .NET Runtime to CoreCLR. When CoreCLR is migrated into Unity in the future, details of the profiling process may need adjusted.

No matter how the runtime changes, any game can be correctly profiled if the four key factors mentioned above are satisfied: Low performance impact, GPU queue, CPU call stack, and hardware metrics. These are the basis of CPU game profiling.

Want more?

Unity

Unity

Develop for Unity® on AMD hardware with our FidelityFX™ Super Resolution patch for URP and built-in support for HDRP.

Unreal Engine

Develop for Unreal Engine on AMD hardware with our plugin, performance and feature patches, including FidelityFX support.

Why stop here? Take a look at our AMD Ryzen™ Performance Guide, Unreal Engine Performance Guide, or browse our useful samples and detailed tutorials.

AMD GPUOpen Samples

Samples Library

Browse all our useful samples. Perfect for when you’re needing to get started, want to integrate one of our libraries, and much more.

AMD GPUOpen Technical blogs

Developer Guides

Browse our technical blogs, and find valuable advice on developing with AMD hardware, ray tracing, Vulkan®, DirectX®, Unreal Engine, and lots more.

AMD GPUOpen documentation

Documentation

Explore our huge collection of detailed tutorials, sample code, presentations, and documentation to find answers to your graphics development questions.