Introduction to profiling tools for AMD hardware

Originally posted: April 12, 2023

Last updated: May 10, 2023

Thomas H. Gibson

Corresponding Author

Noah Wolfe

Corresponding Author

Gina Sitaraman

Author

Suyash Tandon

Author

Nicholas Curtis

Reviewer

Jakub Kurzak

Reviewer

Jonathan Madsen

Reviewer

George Markomanolis

Reviewer

Maria Ruiz Varela

Reviewer

Getting a code to be functionally correct is not always enough. In many industries, it is also required that applications and their complex software stack run as efficiently as possible to meet operational demands. This is particularly challenging as hardware continues to evolve over time, and as a result codes may require further tuning. In practice, many application developers construct benchmarks, which are carefully designed to measure the performance, such as execution time, of a particular code within an operational-like setting. In other words: a good benchmark should be representative of the real work that needs to be done. These benchmarks are useful in that they provide insight into the characteristics of the application, and enables one to discover potential bottlenecks that could result in performance degradation during operational settings.

At face value, benchmarking sounds simple enough and is often interpreted as simply a comparison of execution time on a variety of difference machines. However, in order to extract the most performance from emerging hardware, the program must be tuned many times and requires more than measuring raw execution time: one needs to know where the program is spending most of its time and whether further improvements can be made. Heterogenous systems, where programs run on both CPUs and GPUs, introduce additional complexities. Understanding the critical path and kernel execution is all the more important. Thus, performance tuning is a necessary component in the benchmarking process.

With AMD’s profiling tools, developers are able to gain important insight into how efficiently their application is utilizing hardware and effectively diagnose potential bottlenecks contributing to poor performance. Developers targeting AMD GPUs have multiple tools available depending on their specific profiling needs. This post serves as an introduction to the various profiling tools offered by AMD and why a developer might leverage one over the other. This post covers everything from low level profiling tools to extensive profiling suites.

In this introductory blog, we briefly describe the following tools that can aid in application analysis:

ROC-profiler
Omniperf
Omnitrace
Radeon™ GPU Profiler
AMD uProf
Other Third Party Tools

Terminology

The following terms are used in this blog post:

Term	Description
AMD “Zen” Core	AMD’s x86-64 processor core architecture design. Used by the AMD EPYC™, AMD Ryzen™, AMD Ryzen™ PRO, and AMD Threadripper™ PRO processor series.
RDNA™	AMD’s Traditional GPU architecture optimized for graphically demanding workloads like gaming and visualization. Includes the RX 5000, 6000 and 7000 GPUs.
CDNA™	AMD’s Compute dedicated GPU architecture optimized for accelerating HPC, ML/AI, and data center type workloads. Includes the AMD Instinct™ MI50/60, MI100, and MI200 series accelerators.
HIP	A C++ Runtime API and kernel language that allows developers to create portable compute kernels/applications for AMD and NVIDIA GPUs from a single source code
Timeline Trace	A profiling approach where durations of compute kernels and data transfers between devices are collected and visualized
Roofline Analysis	Hardware agnostic methodology for quantifying a workload’s ability to saturate the given compute architecture in terms of floating-point compute and memory bandwidth
Hardware Counters	Individual metrics which track how many times a certain event occurs in the hardware, such as bytes moved from L2 cache or a 32 bit floating point add performed

What tools to use?

The first step in profiling is determining the right tool for the job. Whether one wants to collect traces on the CPU, GPU, or both, understand kernel behavior, or assess memory access patterns, performing such an analysis might appear daunting for new users of AMD hardware. We begin by identifying the architecture and operating systems supported by each of the profiling tools provided by AMD. Almost all the tools in Table 1 support Linux® distros and with the gaining popularity of Instinct™ GPUs, every tool has some capability to profile codes running on CDNA™ architecture. However, those who prefer Windows will be limited to using AMD uProf to profile CPU and GPU codes targeting AMD “Zen”-based processors and AMD Instinct™ GPUs, and Radeon™ GPU Profiler that can provide great insights to optimize applications’ use of the graphics pipeline (rasterization, shaders, etc.) on RDNA™-based GPUs.

AMD Profiling Tools	AMD “Zen” Core	RDNA™	CDNA™	Windows	Linux®
ROC-profiler	_{^{Not supported}}	☆	★	_{^{Not supported}}	★
Omniperf	_{^{Not supported}}	_{^{Not supported}}	★	_{^{Not supported}}	★
Omnitrace	★	☆	★	_{^{Not supported}}	★
Radeon™ GPU Profiler	_{^{Not supported}}	★	☆	★	☆
AMD uProf	★	_{^{Not supported}}	☆	★	☆

★ Full support | ☆ Partial support

Table 1: Profiler/architecture support and operating system needs.

The final choice of the tool on any platform depends on the profiling objective and the kind of analysis required. To make it simpler, we encourage the users to think of their objectives in terms of three questions as depicted in the flow diagram in Figure 1:

Where should I focus my time? : Whether benchmarking a new application, or getting started with a new software package that has not yet been profiled, it is recommended to first identify hotspots in the application that may benefit from quick optimization. In such a scenario, it is best if users start by collecting timelines and traces of their application. On Linux® platforms, Omnitrace enables the collection of CPU and GPU traces, and call stack samples to help identify major hotspots. However, on Windows, one may have to choose between AMD uProf and Radeon™ GPU Profiler depending on the targeted architecture.
How well am I using the hardware?: The first step is to obtain a characterization of workloads that can provide a glimpse into how well the hardware is being utilized. For example, identifying what parts of your application are memory or compute bound. This can be accomplished through roofline profiling. Typically, hotspots are well understood and interest is usually in identifying the performance of a few key kernels or subroutines. At present, roofline profiling is only available through Omniperf on AMD Instinct™ GPUs and AMD uProf on AMD “Zen”-based processors.
Why am I seeing this performance?: Once hotspots are identified and the initial assessment of performance on a particular hardware is completed, the next phase likely involves profiling and collecting the hardware metrics to understand where the observed performance is coming from. On AMD GPUs, tools such as Omnitrace, Omniperf and AMD uProf interface with the low-level ROC-profiler API and/or uses rocprof under-the-hood to gather GPU metrics. We do not recommend using rocprof directly because of the extra overhead in dealing with text/CSV files and hardware-specific metrics unless there is a specific need. On Windows systems, one will have to rely on using either AMD uProf or Radeon™ GPU Profiler.

Quick Tip: The Omni* suite of tools (Omniperf and Omnitrace), available on Linux® platforms, provide an easy-to-use interface for studying performance of the code across AMD hardware and should be treated as “go-to” profiling tools for performance tuning and benchmarking.

Figure 1: Use cases for a variety of AMD profiling tools.

Overview of profiling tools

In this section, we provide a brief overview of the above-mentioned AMD tools and some third-party toolkits.

Omnitrace

Omnitrace is a comprehensive profiling and tracing tool for parallel applications, including HPC and ML packages, written in C, C++, Fortran, HIP, OpenCL™, and Python™ which execute on the CPU or CPU+GPU. It is capable of gathering the performance information of functions through any combination of binary instrumentation, call-stack sampling, user-defined regions, and Python™ interpreter hooks. Omnitrace supports interactive visualization of comprehensive traces in the web browser in addition to high-level summary profiles with mean/min/max/stddev statistics. Beyond runtime information, Omnitrace supports the collection of system-level metrics such as CPU frequency, GPU temperature, and GPU utilization. Process and thread level metrics such as memory usage, page faults, context switches, and numerous other hardware counters are also included.

When analyzing the performance of an application, it is always best to NOT assume you know where the performance bottlenecks are and why they are happening. Omnitrace is the ideal tool for characterizing where optimization would have the greatest impact on the end-to-end execution of the application and/or viewing what else is happening on the system during a performance bottleneck.

../_images/omnitrace-timeline-example.png

Figure 2: Omnitrace timeline trace example.

Please see the official Omnitrace documentation for the latest information. Users are encouraged to submit issues, feature requests, and provide any additional feedback.

Omniperf

Omniperf is a system performance profiler for High-Performance Computing (HPC) and Machine-Learning (ML) workloads using AMD Instinct™ GPUs. Omniperf utilizes AMD ROC-profiler to collect hardware performance counters. The Omniperf tool performs system profiling based on all approved hardware counters for AMD Instinct™ MI200 and MI100 architectures. It provides high level performance analysis features including System Speed-of-Light, IP block Speed-of-Light, Memory Chart Analysis, Roofline Analysis, Baseline Comparisons, and more.

Omniperf takes the guess work out of profiling by removing the need to provide text input files with lists of counters to collect and analyze raw CSV output files as is the case with ROC-profiler. Instead, Omniperf automates the collection of all available hardware counters in one command and provides a graphical interface to help users understand and analyze bottlenecks and stressors for their computational workloads on AMD Instinct™ GPUs. Note that Omniperf collects hardware counters in multiple passes, and will therefore re-run the application during each pass to collect different sets of metrics.

Figure 3: Omniperf memory chart analysis panel.

In a nutshell, Omniperf provides details about hardware activity for a particular GPU kernel. It also supports both a web-based GUI or command-line analyzer, depending on the user’s preference. For up-to-date information on available Omniperf features, we highly encourage readers to view the official Omniperf documentation. Users are encouraged to submit issues, feature requests and we welcome contributions and feedback from the community.

ROC-profiler

The ROC-profiler primarily serves as the low level API for accessing and extracting GPU hardware performance metrics, also typically called performance counters. These counters quantify the performance of the underlying architecture showcasing which pieces of the computational pipeline and memory hierarchy are being utilized. A script/executable command called rocprof is packaged with the ROCm™ installation and provides the functionality for listing all available hardware counters for your specific GPU as well as running applications and collecting counters during the execution.

The rocprof utility also depends on the ROC-tracer and ROC-TX libraries, giving it the ability to collect timeline traces of the GPU software stack as well as user anotated code regions. Note that rocprof is a command-line only utility so input and output takes the format of txt and CSV files. These formats provide a raw view of the data and puts the onus on the user to parse and analyze. Therefore, rocprof gives the user full access and control of raw performance profiling data, but requires extra effort to analyze the collected data.

Radeon™ GPU Profiler

The Radeon™ GPU Profiler is a performance tool that can be used by traditional gaming and visualization developers to optimize DirectX 12 (DX12) and Vulkan™ for AMD RDNA™ hardware. The Radeon™ GPU Profiler (RGP) is a ground-breaking low-level optimization tool from AMD. It provides detailed timing information on Radeon™ Graphics using custom, built-in, hardware thread-tracing, allowing the developer deep inspection of GPU workloads. This unique tool generates easy to understand visualizations of how your DX12 and Vulkan™ games interact with the GPU at the hardware level. Profiling a game is both a quick and simple process using the Radeon™ Developer Panel together with the public display driver.

Note Radeon™ GPU Profiler does have support for OpenCL™ and HIP applications but it requires running on AMD RDNA™ GPUs in the Windows environment. Running HIP and OpenCL™ in the Windows environment is a whole blog series in itself and outside the current recommendation for HPC applications. For HPC workloads, we recommend programming with HIP in a Linux® environment on AMD Instinct™ GPUs and using Omniperf, Omnitrace or ROC-profiler profiling tools on them.

AMD uProf

AMD uProf (AMD MICRO-prof) is a software profiling analysis tool for x86 applications running on Windows, Linux® and FreeBSD operating systems and provides event information unique to the AMD “Zen”-based processors and AMD Instinct™ MI Series accelerators. AMD uProf enables the developer to better understand the limiters of application performance and evaluate improvements.

AMD uProf offers:

Performance Analysis to identify runtime performance bottlenecks of the application
System Analysis to monitor system performance metrics
Roofline analysis
Power Profiling to monitor thermal and power characteristics of the system
Energy Analysis to identify energy hotspots in the application (Windows only)
Remote Profiling to connect to remote Linux® systems (from a Windows host system), trigger collection/translation of data on the remote system and report it in local GUI
Initial support for AMD CDNA™ accelerators is provided in AMD uProf 3.6 for AMD Instinct™ MI100 and MI200 Series devices and new features are under development

Other third-party tools

In the High Performance Computing space, a number of third-party profiling tools have enabled support for ROCm™ and AMD Instinct™ GPUs. This provides a platform for users to maintain a vendor independent approach to profiling, providing an easy-to-use and high level suite of functionality, that can potentially provide a unified profiling experience across numerous architectures. For users already familiar with these tools, it makes for another easy entry point into understanding performance of their workloads on AMD hardware.

Currently available third party profiling tools include:

HPCToolkit
TAU
Vampir
CrayPat (only for CrayOS platforms)

Next time

Stay tuned as we release further posts in this series diving into the details of setting up and utilizing these available tools. Complete with examples!

If you have any questions or comments, please reach out to us on GitHub Discussions

Thomas H. Gibson

Corresponding Author

Thomas Gibson is a Member of Technical Staff (MTS) Software System Design Engineer in the Data Center GPU Software Solutions group. He obtained his PhD in computational mathematics from Imperial College London, where he specialized in mixed finite element discretizations for numerical weather modeling codes. After completing his PhD in 2020, Thomas continued to work on structure-preserving ("compatible") finite element methods and multigrid preconditioners for weather applications. Additionally, he began shifting his research towards accelerating fluid dynamics codes using GPUs and developed high-fidelity/low-dissipation discontinuous Galerkin methods for turbulence/combustion models on GPUs. His current research interests include optimizing C/C++/Fortran GPU applications, iterative solvers and preconditioning, finite element discretizations, and numerical weather prediction applications.

Noah Wolfe

Corresponding Author

Noah Wolfe is a Member of Technical Staff (MTS) in the Data Center GPU Software Solutions group. He is also an AMD member of the Frontier and El Capitan Centers of Excellence (COE) collaborating with application teams to port, analyze, and optimize their HPC workloads for AMD hardware. He has been working at AMD on a wide range of HPC workloads from Cosmology to particle transport, and running at a wide range of scales from small synthetic single-process tests to large-scale runs on thousands of processes. If he is not sitting at his computer, you will find him out and about trying to keep up with his two daughters.

Gina Sitaraman

Author

Gina Sitaraman is a Senior Member of Technical Staff (SMTS) Software System Design Engineer in the Data Center GPU Software Solutions group. She obtained her PhD in Computer Science from the University of Texas at Dallas. She has over a decade of experience in the seismic data processing field developing and optimizing pre-processing, migration and post processing applications using hybrid MPI + OpenMP on CPU clusters and using CUDA or OpenCL on GPUs. She spends her time at AMD solving optimization challenges in scientific applications running on large-scale HPC clusters.

Suyash Tandon

Author

Suyash Tandon is a Senior Member of Technical Staff (SMTS) at AMD with a focus on performance engineering and optimization of scientific applications on modern AMD GPUs that drives world's most sophisticated supercomputers. He has 5+ years of experience in research and development of GPU-accelerated numerical algorithms and applications. Suyash's technical interests lie at the intersection of computation physics with emphasis on fluid dynamics of complex flows, numerical modeling and high performance computing. He holds Ph.D. in Mechanical Engineering and in Scientific Computing from the University of Michigan and is a proud Wolverine.

Nicholas Curtis

Reviewer

Nicholas Curtis is a Senior Member of Technical Staff (SMTS) in the Data Center GPU Software Solutions Group at AMD. Nick has led AMD's efforts working on porting and optimizing Kokkos' HIP backend, and was responsible for LAMMPS porting and optimization for the Frontier Center of Excellence. Nick obtained his PhD in Energy and Thermal Sciences at the University of Connecticut, where he studied GPU-accelerated reacting flow simulation. Nick's research interests range from high-level languages and their implementation on GPUs, to compiler & runtime analysis and optimization, low-level GPU microbenchmarking / profiling, and the interaction of GPU hardware / runtimes with the Linux kernel.

Jakub Kurzak

Reviewer

Jakub Kurzak is a Senior Member of Technical Staff (SMTS) Software System Design Engineer in the Data Center GPU Software Solutions group. He obtained his PhD in Computer Science from the University of Houston, Texas. He has two decades of experience in parallel computing and a decade of experience in GPU computing. Jakub spent 14 years at the Innovative Computing Laboratory (ICL) at the University of Tennessee Knoxville (UTK) working on dense linear algebra software. He spends his time at AMD porting and optimizing large scale scientific application to AMD hardware.

Jonathan Madsen

Reviewer

Jonathan Madsen is a Senior Member of Technical Staff (SMTS) Software System Design Engineer in the Audacious Software group of AMD Research. He is the lead developer of the Omnitrace performance analysis tool and focusing on improving observability and performance analysis for software running on AMD heterogeneous systems. Prior to his time at AMD, he earned his PhD in Nuclear Engineering from Texas A&M University and spent the ensuing four years as an Application Performance Specialist at the National Energy Research Scientific Computing Center (NERSC) of Lawrence Berkeley National Laboratory (LBNL). During his time at NERSC, Jonathan served as one of LBNL's representatives on the C++ standardization committee, served as a mentor in U.S. Department of Energy (DOE) sponsored workshops for the porting software to execute on the GPU, authored a performance analysis toolkit named timemory, authored the multithreading tasking framework for the CERN simulation toolkit named Geant4, and collaborated with researchers from several DOE research labs on the development of a performance portable execution framework targeting all major HPC platforms named Kokkos.

George Markomanolis

Reviewer

George Markomanolis is Principal Member of Technical Staff Software Development Engineer at AMD. He helps with the AMD training, among supporting European HPC sites. He works on understanding and porting codes for AMD GPUs. In his current and previous role, he prepared and gave many trainings regarding HIP porting, HIP programming, benchmarking GPUs, and evaluating various programming models that can be used on AMD GPUs. His research interests are in applications porting on GPUs, benchmarking, performance evaluation/optimization of HPC applications on various technologies, and parallel I/O analysis on filesystems. He is co-developer and member of the IO500 committee. Before joining AMD, he has worked in various supercomputing centers. He obtained his MSc in Computational Science from the National and Kapodistrian University of Athens, Greece in 2008 and his Ph.D. in Computer Science from the Ecole Normale Superieure de Lyon, France in 2014.

Maria Ruiz Varela

Reviewer

Maria Ruiz Varela is Senior Member of Technical Staff at AMD focusing on validation, debugging and quality of HPC applications running on AMD GPUs. Prior to joining AMD, Maria was responsible for RAS system validation for the US DOE Aurora Exascale Supercomputer (A21) at Intel. She has experience in HPC cluster validation, integration, and execution, as well as extensive SW engineering experience supporting mission and safety critical applications for the Automotive industry in the US and Mexico. She has published research in the areas of fault-tolerance for massively-parallel-processing, large-scale systems and emerging non-volatile memories for embedded systems. She is a member of the SC21, SC22 and SC23 Inclusivity committees. Maria holds a M.Sc. in Computer Science from University of Delaware.