目录

    以 CentOS 7.7,Tesla P100 GPU 为例。

    1. 基础环境准备

    • 安装 lspci 命令
    yum install -y pciutils
    
    • 检查 GPU 是否支持 CUDA
    lspci | grep -i nvidia
    
    00:09.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 12GB] (rev a1)
    

    支持 CUDA 的 GPU 列表:https://developer.nvidia.com/cuda-gpus

    • 检查系统是否支持 CUDA
    uname -m && cat /etc/redhat-release
    
    x86_64
    CentOS Linux release 7.7.1908 (Core)
    

    支持 CUDA 的 OS 列表:https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#system-requirements

    • 安装系统工具包
    yum update -y
    yum install -y wget vim gcc
    yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
    
    • 安装 Docker

    需要安装不低于 19.03 的版本,参考链接

    安装 Docker 参考链接: CentOS 7 安装指定版本的 Docker

    2. 安装 GPU 驱动 & CUDA

    2.1 禁用系统默认的 nouveau 驱动

    屏蔽前:

    lsmod | grep nouveau
    
    nouveau              1898794  0
    mxm_wmi                13021  1 nouveau
    wmi                    21636  2 mxm_wmi,nouveau
    video                  24538  1 nouveau
    i2c_algo_bit           13413  1 nouveau
    ttm                    96673  2 bochs_drm,nouveau
    drm_kms_helper        186531  2 bochs_drm,nouveau
    drm                   456166  5 ttm,bochs_drm,drm_kms_helper,nouveau
    

    禁用 nouveau :

    bash -c "echo blacklist nouveau > /etc/modprobe.d/blacklist-nvidia-nouveau.conf"
    bash -c "echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf"
    

    重建 initramfs image

    mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
    dracut /boot/initramfs-$(uname -r).img $(uname -r)
    

    重启系统,屏蔽后:

    lsmod | grep nouveau
    
    (结果为空)
    

    2.2 安装 GPU 驱动

    有两种安装方法:

    • 第一种,安装 kmod-nvidia 驱动

    添加源

    rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
    rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm
    

    安装 nvidia-detect :

    yum install -y nvidia-detect
    

    检测是否有对应的 kmod-nvidia 版本:

    nvidia-detect -v
    

    安装 kmod-nvidia 驱动:

    yum install -y kmod-nvidia
    
    • 第二种,下载官网驱动安装

    在 Nvidia 官网 驱动下载 页面,找到 lspci | grep -i nvidia 命令显示的 GPU 类型。

    wget http://cn.download.nvidia.com/tesla/440.64.00/nvidia-driver-local-repo-rhel7-440.64.00-1.0-1.x86_64.rpm
    rpm -Uvh nvidia-driver-local-repo-rhel7-440.64.00-1.0-1.x86_64.rpm
    

    也可以下载 Shell 脚本安装

    wget http://us.download.nvidia.com/tesla/440.33.01/NVIDIA-Linux-x86_64-440.64.00.run
    chmod +x NVIDIA-Linux-x86_64-440.64.00.run
    bash ./NVIDIA-Linux-x86_64-440.64.00.run
    

    2.3 安装 CUDA

    在 Nvidia 开发者 cuda-toolkit-archive 页面,找到最新版本的工具包。根据页面提示,选择自己的操作系统,下面是 CentOS 7.7 得到的安装命令:

    wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda-repo-rhel7-10-2-local-10.2.89-440.33.01-1.0-1.x86_64.rpm
    sudo rpm -i cuda-repo-rhel7-10-2-local-10.2.89-440.33.01-1.0-1.x86_64.rpm
    sudo yum clean all
    sudo yum -y install nvidia-driver-latest-dkms cuda
    sudo yum -y install cuda-drivers
    

    2.4 验证是否安装成功

    重启机器之后,检测 Nvidia CUDA 是否安装成功。

    nvidia-smi
    
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla P100-PCIE...  Off  | 00000000:00:09.0 Off |                    0 |
    | N/A   35C    P0    27W / 250W |      0MiB / 12198MiB |      6%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    

    3. 安装 nvidia-docker

    nvidia-docker 提供了在 Docker 中使用 GPU 加速的支持。

    • 安装 nvidia-docker
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
    yum install -y nvidia-container-runtime nvidia-container-toolkit nvidia-docker2
    
    • 添加新的 runtime

    编辑 /etc/docker/daemon.json 文件,新增如下内容:

    {
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
            }
        }
    }
    
    • 重启 Docker 生效
    systemctl restart docker
    
    • 验证是否安装成功
    docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
    
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla P100-PCIE...  Off  | 00000000:00:09.0 Off |                    0 |
    | N/A   36C    P0    26W / 250W |      0MiB / 12198MiB |      6%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    
    docker run --runtime=nvidia nvidia/cuda:10.0-base nvidia-smi
    
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla P100-PCIE...  Off  | 00000000:00:09.0 Off |                    0 |
    | N/A   36C    P0    26W / 250W |      0MiB / 12198MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    
    nvidia-docker run nvidia/cuda:10.0-base nvidia-smi
    
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla P100-PCIE...  Off  | 00000000:00:09.0 Off |                    0 |
    | N/A   35C    P0    26W / 250W |      0MiB / 12198MiB |      6%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    

    4. 参考