Invest In Crypto News
  • Home
  • Latest News
    • Bitcoin News
    • Altcoin News
    • Ethereum News
    • Blockchain News
    • Doge News
    • NFT News
    • Video
    • Market Analysis
    • Business
    • Finance
    • Politics
    • Mining
    • Regulation
    • Technology
  • Top 10 Cryptos
  • Market Cap List
  • IC DAO
  • Donations
  • Contact
  • Buy Crypto
  • IC DAO
No Result
View All Result
Invest In Crypto News
  • Home
  • Latest News
    • Bitcoin News
    • Altcoin News
    • Ethereum News
    • Blockchain News
    • Doge News
    • NFT News
    • Video
    • Market Analysis
    • Business
    • Finance
    • Politics
    • Mining
    • Regulation
    • Technology
  • Top 10 Cryptos
  • Market Cap List
  • IC DAO
  • Donations
  • Contact
  • Buy Crypto
  • IC DAO
No Result
View All Result
Invest In Crypto News
No Result
View All Result

NVIDIA Enhances Llama 3.1 405B Performance with TensorRT Model Optimizer

CryptoExpert by CryptoExpert
August 29, 2024
in Blockchain News
0
Nvidia's Soaring Data Center Revenue Signals Strong AI and GPU Market Position
  • Facebook
  • Twitter
  • Pinterest


You might also like

0G Foundation and Alibaba Cloud Partner to Bring Qwen LLMs Onchain

Umbra Shuts Front End Amid $280M Kelp Exploit Fallout

Canton, ZKsync Clash Over How Blockchains Enforce Rules



Lawrence Jengar
Aug 29, 2024 16:10

NVIDIA’s TensorRT Model Optimizer significantly boosts performance of Meta’s Llama 3.1 405B large language model on H200 GPUs.





Meta’s Llama 3.1 405B large language model (LLM) is achieving new levels of performance thanks to NVIDIA’s TensorRT Model Optimizer, according to the NVIDIA Technical Blog. The enhancements have resulted in up to a 1.44x increase in throughput when running on NVIDIA H200 GPUs.

Outstanding Llama 3.1 405B Inference Throughput with TensorRT-LLM

TensorRT-LLM has already delivered remarkable inference throughput for Llama 3.1 405B since the model’s release. This was achieved through various optimizations, including in-flight batching, KV caching, and optimized attention kernels. These techniques have accelerated inference performance while maintaining lower precision compute.

TensorRT-LLM added support for the official Llama FP8 quantization recipe, which calculates static and dynamic scaling factors to preserve maximum accuracy. Additionally, user-defined kernels such as matrix multiplications from FBGEMM are optimized via plug-ins inserted into the network graph at compile time.

Boosting Performance Up to 1.44x with TensorRT Model Optimizer

NVIDIA’s custom FP8 post-training quantization (PTQ) recipe, available through the TensorRT Model Optimizer library, enhances Llama 3.1 405B throughput and reduces latency without sacrificing accuracy. This recipe incorporates FP8 KV cache quantization and self-attention static quantization, reducing inference compute overhead.

okex

Table 1 demonstrates the maximum throughput performance, showing significant improvements across various input and output sequence lengths on an 8-GPU HGX H200 system. The system features eight NVIDIA H200 Tensor Core GPUs with 141 GB of HBM3e memory each and four NVLink Switches, providing 900 GB/s of GPU-to-GPU bandwidth.




Maximum Throughput Performance – Output Tokens/Second8 NVIDIA H200 Tensor Core GPUs


Input | Output Sequence Lengths
2,048 | 128
32,768 | 2,048
120,000 | 2,048


TensorRT Model Optimizer FP8
463.1
320.1
71.5


Official Llama FP8 Recipe
399.9
230.8
49.6


Speedup
1.16x
1.39x
1.44x

Table 1. Maximum throughput performance of Llama 3.1 405B with NVIDIA internal measurements

Similarly, Table 2 presents the minimum latency performance using the same input and output sequence lengths.




Batch Size = 1 Performance – Output Tokens/Second8 NVIDIA H200 Tensor Core GPUs


Input | Output Sequence Lengths
2,048 | 128
32,768 | 2,048
120,000 | 2,048


TensorRT Model Optimizer FP8
49.6
44.2
27.2


Official Llama FP8 Recipe
37.4
33.1
22.8


Speedup
1.33x
1.33x
1.19x

Table 2. Minimum latency performance of Llama 3.1 405B with NVIDIA internal measurements

These results indicate that H200 GPUs with TensorRT-LLM and TensorRT Model Optimizer are delivering superior performance in both latency-optimized and throughput-optimized scenarios. The TensorRT Model Optimizer FP8 recipe also achieved comparable accuracy with the official Llama 3.1 FP8 recipe on the Massively Multitask Language Understanding (MMLU) and MT-Bench benchmarks.

Fitting Llama 3.1 405B on Just Two H200 GPUs with INT4 AWQ

For developers with hardware resource constraints, the INT4 AWQ technique in TensorRT Model Optimizer compresses the model, allowing Llama 3.1 405B to fit on just two H200 GPUs. This method reduces the required memory footprint significantly by compressing the weights down to 4-bit integers while encoding activations using FP16.

Tables 4 and 5 show the maximum throughput and minimum latency performance measurements, demonstrating that the INT4 AWQ method provides comparable accuracy scores to the Llama 3.1 official FP8 recipe from Meta.




Maximum Throughput Performance – Output Tokens/Second2 NVIDIA H200 Tensor Core GPUs


Input | Output Sequence Lengths
2,048 | 128
32,768 | 2,048
60,000 | 2,048


TensorRT Model Optimizer INT4 AWQ
75.6
28.7
16.2

Table 4. Maximum throughput performance of Llama 3.1 405B with NVIDIA internal measurements




Batch Size = 1 Performance – Output Tokens/Second2 NVIDIA H200 Tensor Core GPUs


Input | Output Sequence Lengths
2,048 | 128
32,768 | 2,048
60,000 | 2,048


TensorRT Model Optimizer INT4 AWQ
21.6
18.7
12.8

Table 5. Minimum latency performance of Llama 3.1 405B with NVIDIA internal measurements

NVIDIA’s advancements in TensorRT Model Optimizer and TensorRT-LLM are paving the way for enhanced performance and efficiency in running large language models like Llama 3.1 405B. These improvements offer developers more flexibility and cost-efficiency, whether they have extensive hardware resources or more constrained environments.

Image source: Shutterstock



Source link

  • Facebook
  • Twitter
  • Pinterest
CryptoExpert

CryptoExpert

Recommended For You

0G Foundation and Alibaba Cloud Partner to Bring Qwen LLMs Onchain

by CryptoExpert
April 22, 2026
0
0G Foundation and Alibaba Cloud Partner to Bring Qwen LLMs Onchain

Key Takeaways: 0G Foundation and Alibaba Cloud launched onchain access to Qwen LLMs for AI agents on April 21, 2026. The move shifts AI from APIs to tokenized...

Read more

Umbra Shuts Front End Amid $280M Kelp Exploit Fallout

by CryptoExpert
April 22, 2026
0
Pyth Network Integrates Price Oracles with IOTA EVM

Alvin Lang Apr 22, 2026 06:51 Umbra disables its front-end to hinder hackers from laundering $280M stolen in the Kelp DAO exploit. Privacy protocols...

Read more

Canton, ZKsync Clash Over How Blockchains Enforce Rules

by CryptoExpert
April 21, 2026
0
Canton, ZKsync Clash Over How Blockchains Enforce Rules

Banks are moving onchain through competing models that take different approaches to how financial rules are enforced.On the one hand are blockchain-native builders like Matter Labs co-founder Alex...

Read more

Japanese Government Bond Collateral Goes Onchain in New JSCC and Mizuho Blockchain Pilot

by CryptoExpert
April 21, 2026
0
Japanese Government Bond Collateral Goes Onchain in New JSCC and Mizuho Blockchain Pilot

Key Takeaways: JSCC, Mizuho, and Nomura launched a PoC on April 20, 2026, to test JGB digital collateral on the Canton Network. The JFSA-backed trial targets 24/7 real-time...

Read more

PEPE Flatlining at $0.0000045 – Technical Deadlock Points to $0.000006+ Breakout Within 72 Hours

by CryptoExpert
April 21, 2026
0
Bitcoin Hits $118K All-Time High: Analyzing Market Momentum, Technicals, and Future Outlook

Ted Hisokawa Apr 21, 2026 07:27 PEPE sits trapped in neutral territory with RSI at 54.69 and MACD at zero, but stochastic crossover signals...

Read more
Next Post
Tron Holder Turns $1000 Into $10,000 With Gambling Crypto Mpeppe (MPEPE)

Tron Holder Turns $1000 Into $10,000 With Gambling Crypto Mpeppe (MPEPE)

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Browse by Category

  • Altcoin News
  • Bitcoin News
  • Blockchain News
  • Business
  • Doge News
  • Ethereum News
  • Finance
  • Market Analysis
  • Mining
  • NFT News
  • Politics
  • Regulation
  • Technology
  • Trending Cryptos
  • Video

Sitemap

  • Market Cap
  • Donations
  • Trading
  • Mining
  • Contact

Legal Information

  • Privacy Policy
  • Anti-Spam Policy
  • Copyright Notice
  • DMCA Compliance
  • Social Media Disclaimer
  • Terms Of Service

Categories

  • Altcoin News
  • Bitcoin News
  • Blockchain News
  • Business
  • Doge News
  • Ethereum News
  • Finance
  • Market Analysis
  • Mining
  • NFT News
  • Politics
  • Regulation
  • Technology
  • Trending Cryptos
  • Video

© Copyright 2024 InvestInCryptoNews.com

No Result
View All Result
  • Home
  • Latest News
    • Bitcoin News
    • Altcoin News
    • Ethereum News
    • Blockchain News
    • Doge News
    • NFT News
    • Video
    • Market Analysis
    • Business
    • Finance
    • Politics
    • Mining
    • Regulation
    • Technology
  • Top 10 Cryptos
  • Market Cap List
  • IC DAO
  • Donations
  • Contact
  • Buy Crypto
  • IC DAO

© Copyright 2024 InvestInCryptoNews.com

This website is using cookies to improve the user-friendliness. You agree by using the website further.

Privacy policy
bitcoin
Bitcoin (BTC) $ 79,326.00
ethereum
Ethereum (ETH) $ 2,410.15
tether
Tether (USDT) $ 1.00
xrp
XRP (XRP) $ 1.46
bnb
BNB (BNB) $ 650.98
usd-coin
USDC (USDC) $ 0.99983
solana
Solana (SOL) $ 88.38
tron
TRON (TRX) $ 0.328589
figure-heloc
Figure Heloc (FIGR_HELOC) $ 1.03
staked-ether
Lido Staked Ether (STETH) $ 2,265.05

Pin It on Pinterest

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?