Optimization of Onnx Models for TensorRT Execution

Introduction

TensorRT is utilized for running deep learning models on Nvidia GPUs efficiently. TensorRT supports parsing Onnx and Caffe models. In addition to TensorRT, trtexec which is a tool for using TensorRT without any development environment provides serialized engines from Onnx, Caffe or UFF models and benchmarking of networks. Further information can be obtained on Tensorrt documentation.

In this post, optimization of Onnx models for TensorRT execution for faster inference and efficient memory usage will be investigated by using trtexec tool.

Trtexec(also TensorRT) has different configuration options for building models(serializing) and running inference. Fistly, we will examine building options for efficient models.

Build Options

A. TacticSource

It provides tactics for TensorRT for efficient inference. Nevertheless, it leads to incrase in GPU memory. Therefore, if you need smaller memory allocation, available tactic sources should be disabled. Available options:

Usage:

trtexec --onnx=model.onnx --saveEngine=model.plan 
--tacticSources=+CUBLAS,-CUBLAS_LT,-CUDNN,+EDGE_MASK_CONVOLUTIONS

+:enable selected tactics
-:disable selected tactics

B. fp16

It enables float16 support in addition to float32. Weights will be converted to fp16 if fp16 flag is enabled and layers are supported. It decreases size of serialized engine file and accelerates inference.

Usage:

trtexec --onnx=model.onnx --saveEngine=model.plan --fp16

There are many other available options i.e. int8, sparsity, maxBatch etc. Available options can be listed via

trtexec --help

Inference Options

A. noDataTransfers

It disables DMA transfers to and from device and decreases GPU speed fluactuations.

Usage:

trtexec --onnx=model.onnx --saveEngine=model.plan --noDataTransfers

B. useSpinWait

It decreases synchronization time and GPU speed fluactuations but increases CPU usage and power.

Usage:

trtexec --onnx=model.onnx --saveEngine=model.plan --useSpinWait

C. streams

It instantiates N engines to use concurrently and increases throughput per seconds(tps). However, it increas a bit of GPU memory usage.

Note: It should be set 1 which is default value if dynamic batch size is used in models.

Usage:

trtexec --onnx=model.onnx --saveEngine=model.plan --streams=2

D. useCudaGraph

It enables CUDA graph to capture engine execution and then launches inference. It increases Gpu runtime speed greatly.

Usage:

trtexec --onnx=model.onnx --saveEngine=model.plan --useCudaGraph

There are many other available options i.e. threads, exposeDMA etc. Available options can be listed via

trtexec --help