Fluid 使用 Lustre Runtime 以及性能测试

1. 分析 Fluid 挂载 NFS 存储

查看 Fuse Pod

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
kubectl get pod nfs-demo-fuse-f9wg8 -oyaml
apiVersion: v1
kind: Pod
metadata:
  generateName: nfs-demo-fuse-
spec:
  containers:
  - command:
    - /usr/local/bin/entrypoint.sh
    env:
    - name: FLUID_RUNTIME_TYPE
      value: thin
    - name: FLUID_RUNTIME_NS
      value: default
    - name: FLUID_RUNTIME_NAME
      value: nfs-demo
    - name: MOUNT_POINT
      value: /runtime-mnt/thin/default/nfs-demo/thin-fuse
    - name: MOUNT_OPTIONS
      value: ro
    image: fluidcloudnative/nfs:v0.1
    imagePullPolicy: IfNotPresent
    lifecycle:
      preStop:
        exec:
          command:
          - sh
          - -c
          - umount /runtime-mnt/thin/default/nfs-demo/thin-fuse
    name: thin-fuse
    securityContext:
      privileged: true
    volumeMounts:
    - mountPath: /etc/fluid/config.json
      name: thin-conf
      readOnly: true
      subPath: config.json
    - mountPath: /etc/fluid/runtime.json
      name: runtime
      readOnly: true
      subPath: runtime.json
  hostNetwork: true

在启动之后，Fuse 会将存储目录挂载到 Node 上; 在停止之前，卸载存储目录。

查看 Fluid 注入的配置文件

1
2
cat  /etc/fluid/config.json
{"mounts":[{"mountPoint":"x.x.x.x:/x-x","name":"nfs-demo"}],"targetPath":"/runtime-mnt/thin/default/nfs-demo/thin-fuse","accessModes":["ReadOnlyMany"]}

1
2
cat  /etc/fluid/runtime.json
{"workers":[],"fuses":[]}

Fluid 会将 Dataset 中的 mountPoint 配置通过 Json 文件挂载注入到 Fuse 中。

查看 Fuse 的启动脚本

1
2
3
4
5
6
7
8
9
cat /usr/local/bin/entrypoint.sh
#!/usr/bin/env bash
set +x

python3 /fluid_config_init.py

chmod u+x /mount-nfs.sh

bash /mount-nfs.sh

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
cat fluid_config_init.py
#!/usr/bin/env python

import json

rawStr = ""
with open("/etc/fluid/config.json", "r") as f:
    rawStr = f.readlines()

rawStr = rawStr[0]

script = """
#!/bin/sh
set -ex
MNT_FROM=$mountPoint
MNT_TO=$targetPath


trap "umount ${MNT_TO}" SIGTERM
mkdir -p ${MNT_TO}
mount -t nfs ${MNT_FROM} ${MNT_TO}
sleep inf
"""

obj = json.loads(rawStr)

with open("mount-nfs.sh", "w") as f:
    f.write("mountPoint=\"%s\"\n" % obj['mounts'][0]['mountPoint'])
    f.write("targetPath=\"%s\"\n" % obj['targetPath'])
    f.write(script)

Fuse Pod 启动时，会解析 config.json 文件，生成 mount-nfs.sh 脚本，并执行。

2. 打包 Fluid Lustre Runtime 镜像

从上面的分析，我们看到对于这种 mount 挂载类型的文件存储服务，只需要打包一个对应的 Fuse 镜像即可接入 Fluid 进行管理。

创建 fluid_config_init.py 脚本

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/usr/bin/env python

import json

rawStr = ""
with open("/etc/fluid/config.json", "r") as f:
    rawStr = f.readlines()

rawStr = rawStr[0]

script = """
#!/bin/sh
set -ex
MNT_FROM=$mountPoint
MNT_TO=$targetPath

trap "umount ${MNT_TO}" SIGTERM
mkdir -p ${MNT_TO}
mount -t lustre -o relatime,flock ${MNT_FROM} ${MNT_TO}
sleep inf
"""

obj = json.loads(rawStr)

with open("mount-lustre.sh", "w") as f:
    f.write('mountPoint="%s"\n' % obj["mounts"][0]["mountPoint"])
    f.write('targetPath="%s"\n' % obj["targetPath"])
    f.write(script)

只需调整一下 mount 命令即可。

创建启动脚本 entrypoint.sh

1
2
3
4
5
6
#!/usr/bin/env bash
set +x

python /fluid_config_init.py
chmod u+x /mount-lustre.sh
bash /mount-lustre.sh

创建 Dockerfile 打包镜像

1
2
3
4
5
FROM amazon/aws-fsx-csi-driver:8e204f0ab565dd116bc39699391b6d642a3ae900
COPY ./fluid_config_init.py /
COPY ./entrypoint.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/entrypoint.sh
ENTRYPOINT []

编译镜像并推送镜像

1
docker build -t shaowenchen/demo:fluid-lustre . --push

3. Lustre 接入 Fluid

创建 Dataset

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
kubectl apply -f - <<EOF
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: lustre-demo
spec:
  mounts:
  - mountPoint: fs-x.fsx.us-west-2.amazonaws.com@tcp:/x
    name: lustre-demo
EOF

注意这里的 mountPoint，如果需要挂载子目录 subdir，请提前创建。在生产过程中，多个 PVC 可能会共用一个 Lustre 后端。

子目录挂载的格式为: fs-x.fsx.us-west-2.amazonaws.com@tcp:/x/subdir

创建 Runtime

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
kubectl apply -f - <<EOF
apiVersion: data.fluid.io/v1alpha1
kind: ThinRuntimeProfile
metadata:
  name: lustre
spec:
  fileSystemType: lustre
  fuse:
    image: shaowenchen/demo:fluid-lustre
    imageTag: latest
    imagePullPolicy: Always
    command:
      - "/usr/local/bin/entrypoint.sh"
EOF

1
2
3
4
5
6
7
8
kubectl apply -f - <<EOF
apiVersion: data.fluid.io/v1alpha1
kind: ThinRuntime
metadata:
  name: lustre-demo
spec:
  profileName: lustre
EOF

创建 Pod

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: lustre-demo
spec:
  containers:
    - name: lustre-demo
      image: shaowenchen/demo:ubuntu
      volumeMounts:
        - mountPath: /data
          name: lustre-demo
  volumes:
    - name: lustre-demo
      persistentVolumeClaim:
        claimName: lustre-demo
  tolerations:
    - key: "node-role.kubernetes.io/control-plane"
      operator: "Exists"
      effect: "NoSchedule"
EOF

4. 性能测试

下图是在 AWS 上申请的 FSx for Lustre 规格。

4.1 直接挂载在主机上顺序读测试

安装 lustre-client

1
2
3
wget -O - https://fsx-lustre-client-repo-public-keys.s3.amazonaws.com/fsx-ubuntu-public-key.asc | gpg --dearmor | sudo tee /usr/share/keyrings/fsx-ubuntu-public-key.gpg >/dev/null
bash -c 'echo "deb [signed-by=/usr/share/keyrings/fsx-ubuntu-public-key.gpg] https://fsx-lustre-client-repo.s3.amazonaws.com/ubuntu focal main" > /etc/apt/sources.list.d/fsxlustreclientrepo.list && apt-get update'
apt install -y linux-aws lustre-client-modules-$(uname -r)

注意需要执行 sudo reboot 重启下机器。

参考文档 https://docs.aws.amazon.com/zh_cn/fsx/latest/LustreGuide/install-lustre-client.html

执行测试

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
fio -direct=1 -iodepth=32 -rw=read -ioengine=libaio -bs=4m -size=10g -numjobs=1 -runtime=1000 -group_reporting -filename=testfile --allow_mounted_write=1 -name=Sequ_Read_Testing

Sequ_Read_Testing: (g=0): rw=read, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [R(1)][96.7%][r=368MiB/s][r=92 IOPS][eta 00m:01s]
Sequ_Read_Testing: (groupid=0, jobs=1): err= 0: pid=74694: Thu May 16 18:40:31 2024
  read: IOPS=86, BW=347MiB/s (364MB/s)(10.0GiB/29479msec)
    slat (msec): min=7, max=260, avg=11.51, stdev=14.19
    clat (usec): min=17, max=2646.5k, avg=344495.25, stdev=192586.04
     lat (msec): min=11, max=2655, avg=356.01, stdev=197.77
    clat percentiles (msec):
     |  1.00th=[  284],  5.00th=[  305], 10.00th=[  309], 20.00th=[  313],
     | 30.00th=[  326], 40.00th=[  330], 50.00th=[  334], 60.00th=[  334],
     | 70.00th=[  338], 80.00th=[  338], 90.00th=[  342], 95.00th=[  347],
     | 99.00th=[  435], 99.50th=[ 2601], 99.90th=[ 2635], 99.95th=[ 2635],
     | 99.99th=[ 2635]
   bw (  KiB/s): min=303104, max=417792, per=100.00%, avg=384223.08, stdev=21823.92, samples=53
   iops        : min=   74, max=  102, avg=93.77, stdev= 5.35, samples=53
  lat (usec)   : 20=0.04%
  lat (msec)   : 20=0.04%, 50=0.12%, 100=0.16%, 250=0.55%, 500=98.16%
  lat (msec)   : 750=0.08%, 1000=0.04%, 2000=0.16%, >=2000=0.66%
  cpu          : usr=0.04%, sys=6.67%, ctx=2614, majf=0, minf=32779
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.3%, 16=0.6%, 32=98.8%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=2560,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=347MiB/s (364MB/s), 347MiB/s-347MiB/s (364MB/s-364MB/s), io=10.0GiB (10.7GB), run=29479-29479msec

4.2 在 Pod 中顺序读测试

进入 Pod

1
2
3
kubectl exec -it lustre-demo bash

cd /data

执行测试

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
fio -direct=1 -iodepth=32 -rw=read -ioengine=libaio -bs=4m -size=10g -numjobs=1 -runtime=1000 -group_reporting -filename=testfile --allow_mounted_write=1 -name=Sequ_Read_Testing

Sequ_Read_TestingA: (g=0): rw=read, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=364MiB/s][r=91 IOPS][eta 00m:00s]
Sequ_Read_TestingA: (groupid=0, jobs=1): err= 0: pid=563: Thu May 16 18:11:28 2024
  read: IOPS=78, BW=315MiB/s (330MB/s)(10.0GiB/32490msec)
    slat (msec): min=7, max=261, avg=12.69, stdev=21.49
    clat (usec): min=3, max=5356.2k, avg=357490.74, stdev=319869.00
     lat (msec): min=10, max=5386, avg=370.18, stdev=334.81
    clat percentiles (msec):
     |  1.00th=[  288],  5.00th=[  309], 10.00th=[  313], 20.00th=[  317],
     | 30.00th=[  326], 40.00th=[  334], 50.00th=[  334], 60.00th=[  338],
     | 70.00th=[  342], 80.00th=[  347], 90.00th=[  347], 95.00th=[  351],
     | 99.00th=[  355], 99.50th=[ 3171], 99.90th=[ 5336], 99.95th=[ 5336],
     | 99.99th=[ 5336]
   bw (  KiB/s): min=98304, max=417792, per=100.00%, avg=376955.83, stdev=41482.14, samples=54
   iops        : min=   24, max=  102, avg=92.00, stdev=10.13, samples=54
  lat (usec)   : 4=0.04%
  lat (msec)   : 20=0.04%, 50=0.12%, 100=0.16%, 250=0.51%, 500=98.24%
  lat (msec)   : 750=0.04%, 1000=0.04%, 2000=0.16%, >=2000=0.66%
  cpu          : usr=0.05%, sys=6.02%, ctx=2757, majf=0, minf=32780
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.3%, 16=0.6%, 32=98.8%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=2560,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=315MiB/s (330MB/s), 315MiB/s-315MiB/s (330MB/s-330MB/s), io=10.0GiB (10.7GB), run=32490-32490msec

Fluid 中，ThinRuntime 的 PVC 性能与主机直接挂载的性能，不会损失很多。这里需要注意，blocksize 和 size 的大小会严重影响测试的结果。如果只读取 1g 的数据，顺序读的性能可以达到 500+ MB/s; 如果 blocksize 为 128k，顺序读的性能又只有 100+ MB/s。因此，需要根据使用场景进行调整，才能准确评估。

5. 总结

最近国内的模型推理服务需要在海外进行部署，我们选定了 AWS FSx for Lustre 作为存储后端，但为了保持业务层使用存储逻辑的一致性，需要将 Lustre 对接到 Fluid 中。

国内的模型上传到 S3 之后，自动同步到 Lustre 中。

Fluid 早期版本就支持 Lustre，但 Fluid 社区中并没有提供详细的文档描述和 Demo 示例，因此本篇主要记录了使用 Fluid 的 ThinRuntime 对接 Lustre 的实践过程。

由于我们仅用来存储推理模型，模型数据通常都是大文件，因此在性能测试方面仅测试了顺序读的速度。在我们选的规格下，PVC 中的速度能达到 300+ MB/s。