Runtime Notes – Qwen 3 Coder 480B A35B Llama.cpp and ik_llama.cpp

When you click on links to various merchants on this site and make a purchase, this can result in this site earning a commission. Affiliate programs and affiliations include, but are not limited to, the eBay Partner Network. As an Amazon Associate I earn from qualifying purchases. #ad #promotions

Qwen 3 Coder 480B is a massive 35B active count MoE domain specific llm that appears to be hard to optimally run for me so far. I have hit a max on the Quad 3090 GPU rig of 3.8 TPS maximum so far and with much less than a 1 million context size. Notes are currently very WIP and updated will occur until I a get to what I ideally think is realistic to  hit, which should be 6.5 TPS range. NO system (AMD Epyc 7702/512GB DDR4/Quad 3090) tuning has been done so far.

Running cmd for ik_llama.cpp (functional, slow)

./ik_llama.cpp/build/bin/llama-server –model /root/hf/unsloth/Qwen3-Coder-480B-A35B-Instruct-1M-GGUF/UD-IQ1_M/Qwen3-Coder-480B-A35B-Instruct-1M-UD-IQ1_M-00001-of-00004.gguf –threads 60 –ctx-size 1048576 –n-gpu-layers 60 –seed 42069 –prio 3 –temp 0.7 –min-p 0.0 –top-p 0.8 –top-k 20 –host 0.0.0.0 –batch-size 64 –flash-attn –cache-type-k q4_0 –cache-type-v q4_0 -ot “.ffn_(up|down)_exps.=CPU”

Flash_attn doesnt appear to improve in this split and is verified compiled in.

TODO Organize setup notes and create benchmark set for performance isolating 4 variables.