Invest In Crypto News
  • Home
  • Latest News
    • Bitcoin News
    • Altcoin News
    • Ethereum News
    • Blockchain News
    • Doge News
    • NFT News
    • Video
    • Market Analysis
    • Business
    • Finance
    • Politics
    • Mining
    • Regulation
    • Technology
  • Top 10 Cryptos
  • Market Cap List
  • IC DAO
  • Donations
  • Contact
  • Buy Crypto
  • IC DAO
No Result
View All Result
Invest In Crypto News
  • Home
  • Latest News
    • Bitcoin News
    • Altcoin News
    • Ethereum News
    • Blockchain News
    • Doge News
    • NFT News
    • Video
    • Market Analysis
    • Business
    • Finance
    • Politics
    • Mining
    • Regulation
    • Technology
  • Top 10 Cryptos
  • Market Cap List
  • IC DAO
  • Donations
  • Contact
  • Buy Crypto
  • IC DAO
No Result
View All Result
Invest In Crypto News
No Result
View All Result

NVIDIA Enhances Llama 3.1 405B Performance with TensorRT Model Optimizer

CryptoExpert by CryptoExpert
August 29, 2024
in Blockchain News
0
Nvidia's Soaring Data Center Revenue Signals Strong AI and GPU Market Position
  • Facebook
  • Twitter
  • Pinterest


You might also like

LG Electronics, Arbitrum Launch Blockchain Ad Network

TRM Warns of World Cup Crypto Scams Targeting Fans

Binance Launches bStocks on BNB Chain: Trade Tokenized US Equities 24/7



Lawrence Jengar
Aug 29, 2024 16:10

NVIDIA’s TensorRT Model Optimizer significantly boosts performance of Meta’s Llama 3.1 405B large language model on H200 GPUs.





Meta’s Llama 3.1 405B large language model (LLM) is achieving new levels of performance thanks to NVIDIA’s TensorRT Model Optimizer, according to the NVIDIA Technical Blog. The enhancements have resulted in up to a 1.44x increase in throughput when running on NVIDIA H200 GPUs.

Outstanding Llama 3.1 405B Inference Throughput with TensorRT-LLM

TensorRT-LLM has already delivered remarkable inference throughput for Llama 3.1 405B since the model’s release. This was achieved through various optimizations, including in-flight batching, KV caching, and optimized attention kernels. These techniques have accelerated inference performance while maintaining lower precision compute.

TensorRT-LLM added support for the official Llama FP8 quantization recipe, which calculates static and dynamic scaling factors to preserve maximum accuracy. Additionally, user-defined kernels such as matrix multiplications from FBGEMM are optimized via plug-ins inserted into the network graph at compile time.

Boosting Performance Up to 1.44x with TensorRT Model Optimizer

NVIDIA’s custom FP8 post-training quantization (PTQ) recipe, available through the TensorRT Model Optimizer library, enhances Llama 3.1 405B throughput and reduces latency without sacrificing accuracy. This recipe incorporates FP8 KV cache quantization and self-attention static quantization, reducing inference compute overhead.

okex

Table 1 demonstrates the maximum throughput performance, showing significant improvements across various input and output sequence lengths on an 8-GPU HGX H200 system. The system features eight NVIDIA H200 Tensor Core GPUs with 141 GB of HBM3e memory each and four NVLink Switches, providing 900 GB/s of GPU-to-GPU bandwidth.




Maximum Throughput Performance – Output Tokens/Second8 NVIDIA H200 Tensor Core GPUs


Input | Output Sequence Lengths
2,048 | 128
32,768 | 2,048
120,000 | 2,048


TensorRT Model Optimizer FP8
463.1
320.1
71.5


Official Llama FP8 Recipe
399.9
230.8
49.6


Speedup
1.16x
1.39x
1.44x

Table 1. Maximum throughput performance of Llama 3.1 405B with NVIDIA internal measurements

Similarly, Table 2 presents the minimum latency performance using the same input and output sequence lengths.




Batch Size = 1 Performance – Output Tokens/Second8 NVIDIA H200 Tensor Core GPUs


Input | Output Sequence Lengths
2,048 | 128
32,768 | 2,048
120,000 | 2,048


TensorRT Model Optimizer FP8
49.6
44.2
27.2


Official Llama FP8 Recipe
37.4
33.1
22.8


Speedup
1.33x
1.33x
1.19x

Table 2. Minimum latency performance of Llama 3.1 405B with NVIDIA internal measurements

These results indicate that H200 GPUs with TensorRT-LLM and TensorRT Model Optimizer are delivering superior performance in both latency-optimized and throughput-optimized scenarios. The TensorRT Model Optimizer FP8 recipe also achieved comparable accuracy with the official Llama 3.1 FP8 recipe on the Massively Multitask Language Understanding (MMLU) and MT-Bench benchmarks.

Fitting Llama 3.1 405B on Just Two H200 GPUs with INT4 AWQ

For developers with hardware resource constraints, the INT4 AWQ technique in TensorRT Model Optimizer compresses the model, allowing Llama 3.1 405B to fit on just two H200 GPUs. This method reduces the required memory footprint significantly by compressing the weights down to 4-bit integers while encoding activations using FP16.

Tables 4 and 5 show the maximum throughput and minimum latency performance measurements, demonstrating that the INT4 AWQ method provides comparable accuracy scores to the Llama 3.1 official FP8 recipe from Meta.




Maximum Throughput Performance – Output Tokens/Second2 NVIDIA H200 Tensor Core GPUs


Input | Output Sequence Lengths
2,048 | 128
32,768 | 2,048
60,000 | 2,048


TensorRT Model Optimizer INT4 AWQ
75.6
28.7
16.2

Table 4. Maximum throughput performance of Llama 3.1 405B with NVIDIA internal measurements




Batch Size = 1 Performance – Output Tokens/Second2 NVIDIA H200 Tensor Core GPUs


Input | Output Sequence Lengths
2,048 | 128
32,768 | 2,048
60,000 | 2,048


TensorRT Model Optimizer INT4 AWQ
21.6
18.7
12.8

Table 5. Minimum latency performance of Llama 3.1 405B with NVIDIA internal measurements

NVIDIA’s advancements in TensorRT Model Optimizer and TensorRT-LLM are paving the way for enhanced performance and efficiency in running large language models like Llama 3.1 405B. These improvements offer developers more flexibility and cost-efficiency, whether they have extensive hardware resources or more constrained environments.

Image source: Shutterstock



Source link

  • Facebook
  • Twitter
  • Pinterest
CryptoExpert

CryptoExpert

Recommended For You

LG Electronics, Arbitrum Launch Blockchain Ad Network

by CryptoExpert
June 12, 2026
0
Cointelegraph

South Korean tech giant LG Electronics is working with the Ethereum layer-2 network Arbitrum to build a blockchain-based advertising network aimed at serving the digital ad industry. Arbitrum would...

Read more

TRM Warns of World Cup Crypto Scams Targeting Fans

by CryptoExpert
June 12, 2026
0
Cointelegraph

TRM Labs warned that crypto scammers are targeting FIFA World Cup fans through fake ticketing sites, fixed-match betting schemes and event-themed crypto promotions. The blockchain intelligence company said it...

Read more

Binance Launches bStocks on BNB Chain: Trade Tokenized US Equities 24/7

by CryptoExpert
June 12, 2026
0
BNB Chain Resolves BscScan Lag Issue, opBNB Still Undergoing Fixes

Terrill Dicki Jun 11, 2026 14:27 Binance debuts bStocks on BNB Chain, enabling 24/7 trading of tokenized US stocks with zero fees and self-custody...

Read more

Franklin Templeton, BNP Paribas See Tokenization Boosting EU’s Capital Efficiency

by CryptoExpert
June 11, 2026
0
Cointelegraph

Large financial institutions are turning to tokenization to improve capital efficiency and liquidity, according to representatives from Franklin Templeton and BNP Paribas.Speaking at a panel at the WAIB...

Read more

CFTC Proposes New Rules for Sports Prediction Markets

by CryptoExpert
June 11, 2026
0
CGV Leads Expansion in Bitcoin Wallet Sector with UniSat Investment

Jessie A Ellis Jun 10, 2026 22:19 The CFTC's proposal could legitimize sports prediction markets while clarifying election contract regulations. Public comments open for...

Read more
Next Post
Tron Holder Turns $1000 Into $10,000 With Gambling Crypto Mpeppe (MPEPE)

Tron Holder Turns $1000 Into $10,000 With Gambling Crypto Mpeppe (MPEPE)

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Browse by Category

  • Altcoin News
  • Bitcoin News
  • Blockchain News
  • Business
  • Doge News
  • Ethereum News
  • Finance
  • Market Analysis
  • Mining
  • NFT News
  • Politics
  • Regulation
  • Technology
  • Trending Cryptos
  • Video

Sitemap

  • Market Cap
  • Donations
  • Trading
  • Mining
  • Contact

Legal Information

  • Privacy Policy
  • Anti-Spam Policy
  • Copyright Notice
  • DMCA Compliance
  • Social Media Disclaimer
  • Terms Of Service

Categories

  • Altcoin News
  • Bitcoin News
  • Blockchain News
  • Business
  • Doge News
  • Ethereum News
  • Finance
  • Market Analysis
  • Mining
  • NFT News
  • Politics
  • Regulation
  • Technology
  • Trending Cryptos
  • Video

© Copyright 2024 InvestInCryptoNews.com

No Result
View All Result
  • Home
  • Latest News
    • Bitcoin News
    • Altcoin News
    • Ethereum News
    • Blockchain News
    • Doge News
    • NFT News
    • Video
    • Market Analysis
    • Business
    • Finance
    • Politics
    • Mining
    • Regulation
    • Technology
  • Top 10 Cryptos
  • Market Cap List
  • IC DAO
  • Donations
  • Contact
  • Buy Crypto
  • IC DAO

© Copyright 2024 InvestInCryptoNews.com

This website is using cookies to improve the user-friendliness. You agree by using the website further.

Privacy policy
bitcoin
Bitcoin (BTC) $ 63,613.00
ethereum
Ethereum (ETH) $ 1,666.12
tether
Tether (USDT) $ 0.9994
bnb
BNB (BNB) $ 603.51
usd-coin
USDC (USDC) $ 0.999782
xrp
XRP (XRP) $ 1.13
solana
Solana (SOL) $ 66.59
tron
TRON (TRX) $ 0.314859
figure-heloc
Figure Heloc (FIGR_HELOC) $ 1.03
staked-ether
Lido Staked Ether (STETH) $ 2,265.05

Pin It on Pinterest

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?