主流のGPU加速OCIサーバーでのオープン・ソース・モデルの実用的な推論 (2024/04/30)

https://blogs.oracle.com/cloud-infrastructure/post/practical-inferencing-open-source-models

投稿者： Krishna Shanmughom | Master Principal Cloud Architect, Oracle

世界中のGenerative AIの機会に対する大きな需要があるため、必要なコンピュート能力の計画は重要です。NVIDIA A100およびH100 Tensor Core GPUは大規模なLLMデプロイメントに優れたパフォーマンスを提供しますが、小規模なデプロイメントでは、T4、P100、A10などの主流のGPUを補完できます。

適切に設計されたOracle Generative AIサービスにより、Oracle Cloud Infrastructure(OCI)は、高効率なOCIサーバー上で推論するための独自のモデル(オープンソースまたはカスタム)を導入することもできます。自社持込みモデルをOCIのみで実行する場合は、主流のNVIDIAアクセラレーテッドOCIサーバーでLLMを実行することで、ベンチマークと最適化が必要になる場合があります。このブログでは、主流のGPUがOCIサーバー(ベアメタルと仮想マシンの両方)を高速化して、Opensource LLMを使用して幅広い推論シナリオを実行する方法について説明します。

ベンチマーク・パラメータ

次に、推論テストのシナリオおよび結果に影響するパラメータの様々なセットを示します。

Generative AIモデル仕様: モデルタイプとサイズ
GPU仕様: GPUのモデルと数
CPU仕様: CPUタイプおよびCPU数
最大コンテキスト・ウィンドウ
パフォーマンスの最適化

定量化モデルと非定量化モデル
トランスフォーマーのような異なったLLMモデル、KVのキャッシュの最適化およびpagedの注意、フラッシュの注目等の変圧器

1秒当たりのトークンで測定されたパフォーマンス

テスト環境

ベンチマークの目的では、次のサーバー構成が使用されます。

OCI server types and specifications

GPU accelerated bare metal
- Intel Xeon Platinum 8358 CPU @ 2.60GHz (128 cores)
- Four NVIDIA A10 Tensor Core GPUs, each with 24GB GDDR6 memory
- 1TB RAM
GPU accelerated VM
- Intel Xeon Platinum 8358 CPU @ 2.60GHz (60 cores)
- Two NVIDIA A10 GPUs, each with 24GB GDDR6 memory
- 480GB RAM
GPU Accelerated Roving Edge Device ( RED)
- Intel(R) Xeon(R) Gold 6230T CPU @ 2.10GHz ( 32 cores)
- One NVIDIA T4 GPU with 16 GB GDDR6 memory
- 512 GB RAM

このベンチマーク演習では、次のLLMモデル(定量化および非定量化バージョン)が使用されます。

Llama 2 models (7B, 13B, and 70B)
Llama 2 HF models (7B, 13B, and 70B)
Llama 3 models ( 8B , 70B )
Fin-llama-33B

単一サーバー、単一ユーザー推論テスト

次の表に、単一のOCIベアメタル・サーバーで llama.cpp を使用してFin-llamaモデルで実行されたテスト結果を示します。

表1: 単一サーバー・シングルユーザー推論tests-finllama


Model type	Transformer model, quantization	Deployment config	OCI instance type	Accelerator type	Number of GPUs	Throughput across all GPUs(tokens/second)
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q2_K.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	29.2
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q3_K_L.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	28.2
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q3_K_M.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	29
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q3_K_S.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	28.4
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q4_0.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	30.9
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33B-GGUF.Q5_K_M.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	29.2
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q4_K_M.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	28.5
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q4_K_S.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	28.6
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33B-GGUF.Q5_K_M.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	29.2
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q5_0.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	27.7
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q5_K_M.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	27.6
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q5_K_S.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	28
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q6_K.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	25.1
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q8_0.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	23.5

次の表に、単一のOracle Roving Edge (RED)サーバーでllama.cppを使用してLlama 2モデルで実行されたテスト結果を示します。

表2: RED上の単一サーバー・シングルユーザー・テスト推論LLama2


Model type	Transformer model, quantization	Deployment config	OCI instance type	Accelerator type	Number of GPUs	Throughput across all GPUs (tokens/second)
Llama-2-7b	Llama-cpp , ggml-model-q4_0.gguf	llama-2-7b.Q4_0.gguf · TheBloke/Llama-2-7B-GGUF at main (huggingface.co)	RED	T4	1	51.9
Llama-2-13b	Llama-cpp , ggml-model-q4_0.gguf	https://huggingface.co/TheBloke/Llama-2-13B-GGUF/blob/main/llama-2-13b.Q4_0.gguf	RED	T4	1	28.6
Llama-2-70b	Llama-cpp , ggml-model-q4_0.gguf	https://huggingface.co/TheBloke/Llama-2-70B-GGUF/blob/main/llama-2-70b.Q4_0.gguf	RED	T4	1	1.6

次の表に、単一のOCIベアメタル・サーバーでllama.cppを使用して、定量化されたLLaMa2 70Bモデルに対して行われたテスト結果を示します。

表3: OCIベアメタル・サーバー上の単一サーバー・シングルユーザー推論定量化Llama2 70Bモデル


Model type	Transformer model, quantization	Deployment config	OCI instance type	Accelerator type	Number of GPUs	Throughput across all GPUs (tokens/second)
Llama-2-70B-Chat-GPTQ	llama.cpp, gptq	TheBloke/Llama-2-70B-Chat-GPTQ at main (huggingface.co)	BM with 4 A10s	A10	4	11.2
Llama-2-70B-Chat-GPTQ	llama.cpp, gptq	TheBloke/Llama-2-70B-Chat-GPTQ at gptq-3bit--1g-actorder_True (huggingface.co)	BM with 4 A10s	A10	4	10.5
Llama-2-70B-Chat-AWQ	llama.cpp,AWQ	TheBloke/Llama-2-70B-Chat-AWQ · Hugging Face	BM with 4 A10s	A10	4	13.6
Llama-2-70B-Chat-GGUF	llama.cpp, GGUF, llama-2-70b-chat.Q3_K_L.gguf	TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	17.5
Llama-2-70B-Chat-GGUF	llama.cpp, GGUF, llama-2-70b-chat.Q4_0.gguf	TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	19.2
Llama-2-70B-Chat-GGUF	llama.cpp, GGUF, llama-2-70b-chat.Q4_K_M.gguf	TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	17.9
Llama-2-70B-Chat-GGUF	llama.cpp, GGUF, llama-2-70b-chat.Q5_K_M.gguf	TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	16.8

単一サーバー、マルチユーザー同時実行性推論テスト

次の表に、単一のOCIベアメタル・サーバー上の同時ユーザーに対してllama.cppを使用してLlama 2モデルで実行されたテスト結果を示します。

表4: OCIベアメタル・サーバー上の単一サーバー、マルチユーザー推論Llama


Model type	Transformer model, quantization	Deployment config	OCI instance type	Accelerator Type	Number of GPUs	Concurrent users	Throughput across all GPUs (tokens/second)
Llama-2-70B-Chat-GGUF	llama.cpp, GGUF, llama-2-70b-chat.Q4_0.gguf	TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	5	10.3
Llama-2-70B-Chat-GGUF	llama.cpp, GGUF, llama-2-70b-chat.Q4_0.gguf	TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	10	8.7
fin-llama-33b.Q4_0.gguf	llama.cpp, GGUF, llama-2-33b-chat.Q4_0.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	5	22
fin-llama-33b.Q4_0.gguf	llama.cpp, GGUF, llama-2-33b-chat.Q4_0.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	10	10.2

複数のサーバーでの分散推論結果

次の表に、4つのOCI REDサーバーでllama.cppを使用し、メッセージ・パス・インタフェース(MPI)を使用して、定量化されたLlama 2モデルで実行されたテスト結果を示します。

表5: 複数のOCI REDサーバーでの定量化されたLlama2の分散推論


Model type	Transformer model, quantization	Deployment config	OCI instance type	Accelerator type	Number of GPUs	Throughput across all GPUs (tokens/second)
Llama-2-7b MPI Run	Llama-cpp , ggml-model-q4_0.gguf	llama-2-7b.Q4_0.gguf · TheBloke/Llama-2-7B-GGUF at main (huggingface.co)	4 REDs	T4	4	52.2
Llama-2-13b MPI Run	Llama-cpp , ggml-model-q4_0.gguf	https://huggingface.co/TheBloke/Llama-2-13B-GGUF/blob/main/llama-2-13b.Q4_0.gguf	4 REDs	T4	4	28.7
Llama-2-70b MPI Run	Llama-cpp , ggml-model-q4_0.gguf	https://huggingface.co/TheBloke/Llama-2-70B-GGUF/blob/main/llama-2-70b.Q4_0.gguf	4 REDs	T4	4	1.6

量子化されていないLLaMA 70Bモデルのメモリー計算

量子化されていないLlamaトランスフォーマ・モデルをA10sで実行する場合、次のメモリー計算が使用されます。

モデルタイプ: Llama
モデルサイズ: 70B
合計メモリー要件: 70B x 2バイト(16ビット) = 140 GB
1 A10 GPUのメモリ= 24 GB
8 GPUのメモリ= 160 GB(各A10 GPUのGPUメモリオーバーヘッドを除く)。

この計算に基づいて、Llama 70Bの非定量化モデルは、torchrun、Ray、MPIなどの分散推論フレームワークを使用して、8つのA10 GPUを持つ2つのOCIベアメタル・サーバーで実行できます。

表6: 2つのOCIベアメタル・サーバーでの非定量化Llama2 70Bモデルの分散推論


Model Type	Transformer Model, Quantization	Deployment config	OCI instance type	Accelerator type	Number of GPUs	Throughput across all GPUs (tokens/second)
Llama-2-70b	Llama2 , 70B model, torchrun	GitHub - meta-llama/llama: Inference code for Llama models	2 BM Servers	A10s	8	8.8

Inference run of unquantized Llama 70B model on two bare metal servers with eight A10s

図1: 8つのA10sを持つ2つのベアメタル・サーバーでの非定量化Llama 70Bモデルの推論実行

次の表に、それぞれ2つのA10sを持つ4つのVMサーバーを使用したLlama 70Bの非定量化モデルのテスト結果を示します。

表7: 4つのOCI VMサーバーでの非定量化Llama 70Bモデルの分散推論


Model type	Transformer model, quantization	Deployment config	OCI instance type	Accelerator type	Number of GPUs	Throughput across all GPUs (tokens/second)
Llama-2-70b	Llama2 , 70B model, torchrun	GitHub - meta-llama/llama: Inference code for Llama models	4 VM Servers	A10s	8	4

次の表に、それぞれ4つのA10sを持つ2つのベアメタル・サーバーを使用した、vLLMトランスフォーマ・モデル(Paged Attention)を使用したLlama2モデルのテスト結果を示します。

表8: 2つのOCIベアメタル・サーバーでvLLMを使用したLlamaの分散推論


Model type	Transformer model, quantiation	Deployment config	OCI instance type	Accelerator type	Number of GPUs	Throughput across all GPUs (tokens/second)
Llama-2-7b	vLLM /PagedAttention/Ray	GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs	2 BM Servers	A10s	8	30.1
Llama-2-13b	vLLM /PagedAttention/Ray	GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs	2 BM Servers	A10s	8	27.3
Llama-2-70b	vLLM /PagedAttention/Ray	GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs	2 BM Servers	A10s	8	12.9

torchrun、Ray、MPIなどの分散推論フレームワークを使用した、8 A10 GPUの2 BMでのLLama3実行の結果を次に示します。

表9: 複数のOCIベアメタル・サーバーでTransformerモデルを使用したLlama3の分散推論


Model Type	Transformer Model, Quantisation	Deployment Config	OCI Instance Type	Accelerator Type	Num GPUs	Throughput across all GPUs (tokens/second)
Meta-Llama-3-70B	llama 3, 70B model, torchrun	https://github.com/meta-llama/llama3/tree/main	2 BM Servers	A10s	8	12.44
Meta-Llama-3-70B-Instruct	llama 3, 70B model, torchrun	https://github.com/meta-llama/llama3/tree/main	2 BM Servers	A10s	8	12.24
Meta-Llama-3-8B	llama 3, 8B model, torchrun	https://github.com/meta-llama/llama3/tree/main	1 BM Server	A10	1	27.10
Meta-Llama-3-8B-Instruct	llama 3, 8B model, torchrun	https://github.com/meta-llama/llama3/tree/main	1 BM Server	A10	1	27.04

次の表に、それぞれ4つのA10sを持つ2つのベアメタル・サーバーを使用した、vLLMトランスフォーマ・モデル(Paged Attention)を使用したLlama3モデルのテスト結果を示します。

表10: 2つのOCIベアメタル・サーバーでvLLMを使用したLlama3の分散推論


Model Type	Transformer Model, Quantisation	Deployment Config	OCI Instance Type	Accelerator Type	Num GPUs	Throughput across all GPUs (tokens/second)
Meta-Llama-3-8B	vLLM /PagedAttention/Ray	GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs	BM 2 Servers	A10s	8	24.61
Meta-Llama-3-70B	vLLM /PagedAttention/Ray	GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs	BM 2 Servers	A10s	8	11.23

次の図は、A10アクセラレーテッドOCI VMおよびBMサーバー上のTransformerおよびvLLMのLLaMa2およびLLaMA3非定量化モデルの推論パフォーマンスをまとめたものです。

図2: OCI BMおよびVMサーバーでの非定量化Llama 70Bモデルの推論

まとめ

前述のベンチマーク演習では、主流のGPUアクセラレーテッドOCIサーバー(A10sなど)を使用して、Opensource大規模言語モデル(LLM)の異なるサイズのアクティビティを推論できることを示しています。より大規模なデプロイメントに最高のパフォーマンスが必要な場合、OCIは、NVIDIA TensorRT-LLMでデプロイされた高度なNVIDIA GPUを提供します。これは、最近のMLPerf Inference v4.0ベンチマークに示すように、優れた結果をもたらします。ソリューションの要件と規模に応じて、主流のGPUアクセラレーテッド・サーバーで7Bモデルや13Bモデルなどの小規模なLLMで作業を開始し、需要とモデル・サイズが増加するにつれて、高度なGPU(A100s、H100sなど)を持つ大規模なクラスタに移行できます。このスケーリングは、顧客向けのGenerative AIソリューションの迅速な導入に役立ちます。

確認

著者は、Mohan Srinivasan、Sreedhara Narayanaswamy、Ram Sivaram、およびHiten Goradiaに、この努力の指導、リーダーシップ、および支援に感謝したいと考えています。また、著者は、Oracle Roving Edge Devices(RED)でMPIクラスタを設定する際の専門知識について、James Georgeに感謝したいと考えています。

免責事項

本論文で紹介したベンチマーク演習は、一般的なガイダンスのみを目的としています。個々のテスト結果は、モデルサイズ、テストパラメータ、パフォーマンス技術、および使用されるハードウェア/ソフトウェアスタックによって異なります。

参照

詳しくは以下のリンクをご覧ください。

Oracle Generative AI Solutions: https://www.oracle.com/artificial-intelligence/generative-ai/

Oracle GPU accelerated BareMetal Servers:https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#bm-gpu

Oracle GPU accelerated VM Servers:https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#vm-gpu

Oracle Roving Edge Servers: https://www.oracle.com/a/ocom/docs/data-sheet-roving-edge-device.pdf

NVIDIA A10 GPUs: https://www.nvidia.com/en-au/data-center/products/a10-gpu/

LLaMA CPP Source Code: https://github.com/ggerganov/llama.cpp

Meta LLaMA2 : meta-llama/llama: Inference code for Llama models (github.com)

Meta LLaMA3: meta-llama/llama3: The official Meta Llama 3 GitHub site