Please enable Javascript to view the contents

NVIDIA DCGM 使用指南

 ·  ☕ 9 分钟

1. 什么是 DCGM

DCGM (Data Center GPU Manager) 是 NVIDIA 提供的一个用于数据中心 GPU 管理和监控的工具集,提供了以下功能:

  • GPU 行为监控
  • GPU 配置管理
  • GPU 策略监督
  • GPU 健康和诊断
  • GPU 计费和进程统计
  • NVSwitch 配置和监控

2. 安装 DCGM

2.1 安装 libnvidia-nscq

一般都是 NVLink 连接 GPU,可以通过 nvidia-smi topo -m 查看是否有 NVSwitch 字样输出判断是否需要安装。

  • 如果有 NvSwitch 可能会查询不到相关信息

报错内容如下:

1
2
3
cat /var/log/nv-hostengine.log

09:52:13.204 ERROR [2539914:2539930] [[NvSwitch]] Not attached to NvSwitches. Aborting [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:975] [DcgmNs::DcgmNvSwitchManager::ReadNvSwitchStatusAllSwitches]
  • 查看 NVIDIA 驱动版本
1
2
nvidia-smi | grep "Driver Version"
| NVIDIA-SMI 570.158.01             Driver Version: 570.158.01     CUDA Version: 12.8     |
  • 安装对应版本的 libnvidia-nscq
1
apt install libnvidia-nscq-570

2.2 安装 DCGM

  • 添加源

Ubuntu 20.04

1
2
3
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
dpkg -i cuda-keyring_1.0-1_all.deb
add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"

Ubuntu 22.04

1
2
3
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
dpkg -i cuda-keyring_1.0-1_all.deb
add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
  • 安装 DCGM
1
apt-get install -y datacenter-gpu-manager
  • 开启 DCGM 服务
1
systemctl start nvidia-dcgm

也可以设置为开机自启

1
systemctl enable nvidia-dcgm

3. 命令行帮助

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
dcgmi --help

Usage: dcgmi
   dcgmi subsystem
   dcgmi -v

Flags:
  -v    vv          获取 DCGMI 版本信息
        subsystem   要访问的目标子系统
 Subsystems Available:
        topo        GPU 拓扑信息(dcgmi topo -h 查看更多)
        stats       进程统计信息(dcgmi stats -h 查看更多)
        diag        系统验证/诊断(dcgmi diag -h 查看更多)
        policy      策略管理(dcgmi policy -h 查看更多)
        health      健康监控(dcgmi health -h 查看更多)
        config      配置管理(dcgmi config -h 查看更多)
        group       GPU 组管理(dcgmi group -h 查看更多)
        fieldgroup  字段组管理(dcgmi fieldgroup -h 查看更多)
        discovery   发现系统中的 GPU(dcgmi discovery -h 查看更多)
        introspect  收集 DCGM 本身的信息(dcgmi introspect -h 查看更多)
        nvlink      显示 NvLink 链路状态和错误计数(dcgmi nvlink -h 查看更多)
        dmon        GPU 统计监控(dcgmi dmon -h 查看更多)
        modules     控制并列出 DCGM 模块
        profile     控制并列出 DCGM 性能分析指标
        set         配置 hostengine 设置
  --    ignore_rest 忽略该标志后面的所有已标记参数
      --version     显示版本信息并退出
  -h  --help        显示使用说明并退出

4. 设备管理

4.1 discovery 查看 GPU

1
dcgmi discovery -l
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA H20                                                     |
|        | PCI Bus ID: 00000000:0F:00.0                                         |
|        | Device UUID: GPU-e3425ddb-41fc-f3ca-4d3f-dd8cbcb6896b                |
+--------+----------------------------------------------------------------------+
...
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
+-----------+
0 CPUs found.

4.2 topo 查看 GPU 拓扑

1
dcgmi topo --gpuid 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
+-------------------+------------------------------------------------------------------------------+
| Topology Information                                                                             |
| GPU ID: 0                                                                                        |
+===================+==============================================================================+
| CPU Core Affinity | 0 - 55, 112 - 167                                                            |
| To GPU 1          | Connected via a CPU-level link                                               |
|                   | Connected via eighteen NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17) |
| To GPU 2          | Connected via a CPU-level link                                               |
|                   | Connected via eighteen NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17) |
| To GPU 3          | Connected via a CPU-level link                                               |
|                   | Connected via eighteen NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17) |
| To GPU 4          | Connected via a CPU-level link                                               |
|                   | Connected via eighteen NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17) |
| To GPU 5          | Connected via a CPU-level link                                               |
|                   | Connected via eighteen NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17) |
| To GPU 6          | Connected via a CPU-level link                                               |
|                   | Connected via eighteen NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17) |
| To GPU 7          | Connected via a CPU-level link                                               |
|                   | Connected via eighteen NVLINKs (Links: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17) |
+-------------------+------------------------------------------------------------------------------+

可以看到 GPU 0 与其他 GPU 都是通过 NVLink 连接。

1
dcgmi nvlink -s
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
+----------------------+
|  NvLink Link Status  |
+----------------------+
GPUs:
    gpuId 0:
        U U U U U U U U U U U U U U U U U U
    gpuId 1:
        U U U U U U U U U U U U U U U U U U
    gpuId 2:
        U U U U U U U U U U U U U U U U U U
    gpuId 3:
        U U U U U U U U U U U U U U U U U U
    gpuId 4:
        U U U U U U U U U U U U U U U U U U
    gpuId 5:
        U U U U U U U U U U U U U U U U U U
    gpuId 6:
        U U U U U U U U U U U U U U U U U U
    gpuId 7:
        U U U U U U U U U U U U U U U U U U
NvSwitches:
    No NvSwitches found.

Key: Up=U, Down=D, Disabled=X, Not Supported=_

每张卡 18 个 NVLink 链路,每个链路的状态都是 Up 。

5. 组织与结构管理

5.1 group 管理分组

group 可以对 GPU 进行分组,便于对 GPU 进行监控和管理。

  • 创建分组
1
dcgmi group -c production

如果重启 DCGM 服务,分组会丢失。

  • 查看分组列表
1
dcgmi group -l
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
+-------------------+----------------------------------------------------------+
| GROUPS                                                                       |
| 3 groups found.                                                              |
+===================+==========================================================+
| Groups            |                                                          |
| -> 0              |                                                          |
|    -> Group ID    | 0                                                        |
|    -> Group Name  | DCGM_ALL_SUPPORTED_GPUS                                  |
|    -> Entities    | GPU 0, GPU 1, GPU 2, GPU 3, GPU 4, GPU 5, GPU 6, GPU 7   |
| -> 1              |                                                          |
|    -> Group ID    | 1                                                        |
|    -> Group Name  | DCGM_ALL_SUPPORTED_NVSWITCHES                            |
|    -> Entities    | None                                                     |
| -> 2              |                                                          |
|    -> Group ID    | 2                                                        |
|    -> Group Name  | production                                               |
|    -> Entities    | None                                                     |
+-------------------+----------------------------------------------------------+
  • 将 GPU 添加到分组

这里将 GPU 0 和 GPU 1 添加到分组 production 中。

1
dcgmi group -g 2 -a 0,1
  • 查看分组信息
1
dcgmi group -l
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
+-------------------+----------------------------------------------------------+
| GROUPS                                                                       |
| 3 groups found.                                                              |
+===================+==========================================================+
| Groups            |                                                          |
| -> 0              |                                                          |
|    -> Group ID    | 0                                                        |
|    -> Group Name  | DCGM_ALL_SUPPORTED_GPUS                                  |
|    -> Entities    | GPU 0, GPU 1, GPU 2, GPU 3, GPU 4, GPU 5, GPU 6, GPU 7   |
| -> 1              |                                                          |
|    -> Group ID    | 1                                                        |
|    -> Group Name  | DCGM_ALL_SUPPORTED_NVSWITCHES                            |
|    -> Entities    | None                                                     |
| -> 2              |                                                          |
|    -> Group ID    | 2                                                        |
|    -> Group Name  | production                                               |
|    -> Entities    | GPU 0, GPU 1                                             |
+-------------------+----------------------------------------------------------+

5.2 fieldgroup 管理字段组

fieldgroup 用来创建和维护字段组,字段组是一些指标的集合,用于监控和统计。每个字段就是一个指标。

  • 查看字段组列表
1
dcgmi fieldgroup -l
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
3 field groups found.
+-------------------+----------------------------------------------------------+
| FIELD GROUPS                                                                 |
+===================+==========================================================+
| ID                | 1                                                        |
| Name              | DCGM_INTERNAL_30SEC                                      |
| Field IDs         | 300                                                      |
+-------------------+----------------------------------------------------------+
+-------------------+----------------------------------------------------------+
| FIELD GROUPS                                                                 |
+===================+==========================================================+
| ID                | 2                                                        |
| Name              | DCGM_INTERNAL_HOURLY                                     |
| Field IDs         | 501, 509, 510, 511, 512, 513                             |
+-------------------+----------------------------------------------------------+
+-------------------+----------------------------------------------------------+
| FIELD GROUPS                                                                 |
+===================+==========================================================+
| ID                | 3                                                        |
| Name              | DCGM_INTERNAL_JOB                                        |
| Field IDs         | 205, 155, 156, 200, 201, 202, 203, 204, 311, 100, 101,   |
|                   | 230, 221, 220, 240, 241, 242, 210, 211, 390, 391,        |
|                   | 392, 84, 241, 240, 409, 419, 429, 439                    |
+-------------------+----------------------------------------------------------+
  • 创建字段组
1
dcgmi fieldgroup -c testgroup -f 1002,1003

再次查看分组列表时,可以看到新创建的字段组。

1
2
3
4
5
6
7
+-------------------+----------------------------------------------------------+
| FIELD GROUPS                                                                 |
+===================+==========================================================+
| ID                | 18                                                       |
| Name              | testgroup                                                |
| Field IDs         | 1002, 1003                                               |
+-------------------+----------------------------------------------------------+

6. 监控与统计

在监控和统计时,可以指定 GPU 卡或分组,指标也可以用 ID 或字段组来指定。

6.1 profile 性能指标

profile 提供了性能指标的收集和查询功能,需要配合 dmonstats 命令查看数据。

  • 查看性能相关的指标
1
dcgmi profile -l
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
+----------------+----------+------------------------------------------------------+
| Group.Subgroup | Field ID | Field Tag                                            |
+----------------+----------+------------------------------------------------------+
| A.0            | 1001     | gr_engine_active                                     |
| A.0            | 1002     | sm_active                                            |
| A.0            | 1003     | sm_occupancy                                         |
| A.0            | 1004     | tensor_active                                        |
| A.0            | 1005     | dram_active                                          |
| A.0            | 1006     | fp64_active                                          |
| A.0            | 1007     | fp32_active                                          |
| A.0            | 1008     | fp16_active                                          |
| A.0            | 1009     | pcie_tx_bytes                                        |
| A.0            | 1010     | pcie_rx_bytes                                        |
| A.0            | 1011     | nvlink_tx_bytes                                      |
| A.0            | 1012     | nvlink_rx_bytes                                      |
| A.0            | 1013     | tensor_imma_active                                   |
| A.0            | 1014     | tensor_hmma_active                                   |
...
  • 查看 group 或者指定 gpu 可用的指标
1
dcgmi profile -g 2 -l
  • 暂停和恢复采集

主要是给 nvprof, nsight compute, nsight systems 开发者使用的。

6.2 dmon 实时统计监控

  • 查看可用的指标 ID
1
dcgmi fieldgroup -l
  • 指定卡分组、指定指标 ID
1
dcgmi dmon -g 2 -e 100,101
1
2
3
4
#Entity   SMCLK        MMCLK
ID
GPU 1     1980         2619
GPU 0     1980         2619
  • 指定卡分组、指定指标组
1
dcgmi dmon -g 2 -f 2
1
2
3
4
#Entity   SPINF                       VTID                        VTPINF                      VTPNM                       VTPCLS                      VTPLC
ID
GPU 1     N/A                         N/A                         N/A                         N/A                         N/A                         N/A
GPU 0     N/A                         N/A                         N/A                         N/A                         N/A                         N/A
  • 直接指定卡,查看指标
1
dcgmi dmon -i 0 -e 1004
1
2
3
4
5
#Entity   TENSO
ID
GPU 0     N/A
GPU 0     0.000
GPU 0     0.000

6.3 stats 进程统计信息

  • 开启收集
1
dcgmi stats -g 2 -e
  • 查看进程的统计信息
1
dcgmi stats -g 2 -p 2038635 -v
  • 查看所有进程的统计信息
1
dcgmi stats -g 2 -j -v
  • 停止收集
1
dcgmi stats -g 2 -d

7. 健康与诊断

7.1 health 健康监控

  • 查看健康监控项
1
dcgmi health -g 2 -f
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
Health monitor systems report
+-----------------+--------------------------------------------------------------------+
| PCIe            | Off                                                                |
| NVLINK          | Off                                                                |
| Memory          | Off                                                                |
| SM              | Off                                                                |
| InfoROM         | Off                                                                |
| Thermal         | Off                                                                |
| Power           | Off                                                                |
| Driver          | Off                                                                |
| NvSwitch NF     | Off                                                                |
| NvSwitch F      | Off                                                                |
+-----------------+--------------------------------------------------------------------+
  • 开启全部监控
1
dcgmi health -g 2 -s a

通过 dcgmi health -g 2 --clear

  • 检查 DCGM 服务状态
1
dcgmi health -g 2 -c
1
2
3
4
5
+---------------------------+----------------------------------------------------------+
| Health Monitor Report                                                                |
+===========================+==========================================================+
| Overall Health            | Healthy                                                  |
+---------------------------+----------------------------------------------------------+

为空时表示健康状态良好。

7.2 diag 诊断故障

主要提供了一下诊断功能:

  • NVML 库、CUDA 库等环境完整性
  • 用户对 GPU 设备的访问权限
  • 驱动或进程冲突
  • GPU 内存、InfoROM 等硬件组件状态
  • 持久化模式、环境变量等配置
1
dcgmi diag -r 1
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 3.3.9                                          |
| Driver Version Detected   | 570.158.01                                     |
| GPU Device IDs Detected   | 20f3,20f3,20f3,20f3,20f3,20f3,20f3,20f3        |
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Pass                                           |
+---------------------------+------------------------------------------------+

这里的 -r 有四个级别,1-4 分别对应不同的测试覆盖范围:

插件测试名称r1(短)秒r2(中)< 2 分钟r3(长)< 30 分钟r4(超长)1-2 小时
软件softwareyesyesyesyes
PCIe + NVLinkpcieyesyesyes
GPU 内存memoryyesyesyes
内存带宽memory_bandwidthyesyesyes
诊断diagnosticyesyes
有针对性的压力targeted_stressyesyes
目标功率targeted_poweryesyes
NVB 带宽nvbandwidthyesyes
内存压测memtestyes
输入 EDPppulseyes

诊断时,如果卡上有负载,可能会有一些 Warning 的告警输出。

8. 系统管理

8.1 config 配置管理

  • 查看配置
1
dcgmi config -g 2 --get -v
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
+------------------------------+------------------------------+------------------------------+
| GPU ID: 0                                                                                  |
| NVIDIA H20                                                                                 |
+==============================+==============================+==============================+
| Field                        | Target                       | Current                      |
+------------------------------+------------------------------+------------------------------+
| Compute Mode                 | Not Specified                | Unrestricted                 |
| ECC Mode                     | Not Specified                | Enabled                      |
| Sync Boost                   | Not Specified                | Not Supported                |
| Memory Application Clock     | Not Specified                | 2619                         |
| SM Application Clock         | Not Specified                | 1980                         |
| Power Limit                  | Not Specified                | 500                          |
+------------------------------+------------------------------+------------------------------+
+------------------------------+------------------------------+------------------------------+
| GPU ID: 1                                                                                  |
| NVIDIA H20                                                                                 |
+==============================+==============================+==============================+
| Field                        | Target                       | Current                      |
+------------------------------+------------------------------+------------------------------+
| Compute Mode                 | Not Specified                | Unrestricted                 |
| ECC Mode                     | Not Specified                | Enabled                      |
| Sync Boost                   | Not Specified                | Not Supported                |
| Memory Application Clock     | Not Specified                | 2619                         |
| SM Application Clock         | Not Specified                | 1980                         |
| Power Limit                  | Not Specified                | 500                          |
+------------------------------+------------------------------+------------------------------+
  • 设置配置

将功率限制在 499 W

1
dcgmi config -g 2 --set -P 499

使配置生效

1
dcgmi config -g 2 --enforce
  • 查看配置
1
dcgmi config -g 2 --get -v | grep "Power Limit"
1
2
| Power Limit                  | 499                          | 499                          |
| Power Limit                  | 499                          | 499                          |

8.2 policy 策略管理

策略用于定义在特定事件发生时触发的操作,可以实现异常的自动处理。

  • 查看策略列表
1
dcgmi policy --get
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Policy information
+-----------------------------+------------------------------------------------+
| Policy Information                                                           |
| DCGM_ALL_SUPPORTED_GPUS                                                      |
+=============================+================================================+
| Violation conditions        | None                                           |
| Isolation mode              | Automatic                                      |
| Action on violation         | None                                           |
| Validation after action     | None                                           |
| Validation failure action   | None                                           |
+-----------------------------+------------------------------------------------+
字段含义
Violation conditionsNone → 没有设置违规触发条件(比如温度、功耗、ECC 错误等都没设)
Isolation modeAutomatic → 隔离模式为自动,当 GPU 出现问题时,DCGM 会自动决定是否隔离(具体动作还要看是否设置了策略)
Action on violationNone → 违规时不采取任何操作(例如不重置 GPU、不关机)
Validation after actionNone → 违规动作后不会做系统验证(比如温度或性能测试)
Validation failure actionNone → 验证失败时也不做任何动作

在默认配置下,DCGM 不会对 GPU 进行任何操作。

  • 添加策略

动作可选值有,0 - 不执行动作;1 - 重置 GPU;

验证是在动作后执行的,0 - 不验证;1 - 检查 GPU 核心和显存基本健康;2 - 全面的显存和核心检查;3 - 全面的硬件检查。

动作和验证是成对的,例如,0,0 表示不执行任何动作和验证;1,2 表示重置 GPU 并执行全面的显存和核心检查。

1
dcgmi policy -g 2 --set 1,2 -T 100

当显卡温度超过 100 度时,DCGM 会执行 1,2 策略。

可选的触发条件,有 -e ECC 双比特错误,-x xid 错误,-p PCIe replay 错误,-n NVLink 错误,-T 温度,-P 功耗。

  • 查看策略
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
dcgmi policy --get -v

Policy information
+-----------------------------+------------------------------------------------+
| Policy Information                                                           |
| GPU ID: 0                                                                    |
+=============================+================================================+
| Violation conditions        | Max temperature threshold - 100                |
| Isolation mode              | Manual                                         |
| Action on violation         | Reset GPU                                      |
| Validation after action     | System Validation (Medium)                     |
| Validation failure action   | None                                           |
+-----------------------------+------------------------------------------------+
...

此时,可以看到 group 2 中 GPU 0 的策略已经设置为: 当温度超过 100 度时,重置 GPU 并执行中等验证。

8.3 modules 查看加载的模块

1
dcgmi modules -l
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
+-----------+--------------------+--------------------------------------------------+
| List Modules                                                                      |
| Status: Success                                                                   |
+===========+====================+==================================================+
| Module ID | Name               | State                                            |
+-----------+--------------------+--------------------------------------------------+
| 0         | Core               | Loaded                                           |
| 1         | NvSwitch           | Loaded                                           |
| 2         | VGPU               | Not loaded                                       |
| 3         | Introspection      | Not loaded                                       |
| 4         | Health             | Not loaded                                       |
| 5         | Policy             | Not loaded                                       |
| 6         | Config             | Not loaded                                       |
| 7         | Diag               | Not loaded                                       |
| 8         | Profiling          | Loaded                                           |
| 9         | SysMon             | Not loaded                                       |
+-----------+--------------------+--------------------------------------------------+
模块名称具体用途
CoreGPU 发现枚举、基础 API 接口、度量数据收集基础、其他模块依赖基础
NvSwitch监控 NVSwitch 状态性能、管理多 GPU 高速互连、DGX/HGX 系统支持
VGPU监控 vGPU 实例资源、管理虚拟 GPU 配置、虚拟化环境性能数据收集
IntrospectionDCGM 内部状态检查、调试故障排除、模块交互状态监控
HealthGPU 温度功耗风扇监控、硬件异常检测、健康报告警报、预测性维护
PolicyGPU 使用策略设置执行、功耗性能限制管理、资源配额访问控制
ConfigGPU 配置参数管理、配置变更应用验证、配置备份恢复、批量配置
DiagGPU 硬件测试执行、内存计算单元测试、诊断报告生成、故障定位
ProfilingGPU 性能指标收集、应用性能分析、PCIe 带宽监控、NVIDIA 事件度量
SysMonCPU 内存网络监控、系统 GPU 关联监控、全栈性能分析、系统度量收集

有些模块可以按需动态加载、卸载,比如执行 dcgmi diag 命令时会自动加载 Diag 模块。

8.4 introspect 查看 DCGM 进程的资源占用

1
dcgmi introspect -s -H
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
+----------------------------------------------------------------------------+
| Introspection Information                                                  |
+============================================================================+

+----------------------------------------------------------------------------+
| Hostengine Process                                                         |
+============================================================================+
| Memory            | 75536.0 KB                                             |
| CPU Utilization   | 0.00 %                                                 |
+-------------------+--------------------------------------------------------+

可以看到 Hostengine 进程的内存占用和 CPU 利用率。

9. 参考


微信公众号
作者
微信公众号