
ScaleLLM: Unlocking Llama2-13B LLM Inference on Consumer GPU RTX 4090, powered by FEDML Nexus AI
* Motivation
* System Design
* Performance Evaluation
* Introducing a metric for LLM inference on Low-end GPU
* Performance on RTX 4090 v.s. vLLM on A100
* Performance of Multi-Services on Single A100
* Performance on L4 and T4
* FEDML Nexus AI Serverless Model Endpoint: Serving LLM on decentralized spot GPU instance
* Unlock AI x