Intel Arc Pro B70 vs Nvidia RTX Pro 4000 vs AMD R9700: Local AI GPU Benchmarks with vLLM
TL;DR
Alex Ziskind benchmarks Intel's new Arc Pro B70 (32GB VRAM, under $1,000) against the Nvidia RTX Pro 4000 ($1,699, 24GB GDDR7) and AMD Radeon AI R9700 ($1,300, 32GB GDDR6) for local AI inference, image generation, and video generation. He then scales to four B70s (128GB total VRAM) and tests multi-GPU performance. The B70 delivers competitive or better performance than cards costing significantly more, though Intel's software stack remains a limiting factor.
The Hardware Lineup and Pricing
The video compares three professional-tier GPUs for local AI workloads:
**Intel Arc Pro B70**: 32GB VRAM, 68 GB/s memory bandwidth, under $1,000. Requires an external power cable (unlike the smaller B50 which draws power from PCIe alone).
**Nvidia RTX Pro 4000 Black**: 24GB GDDR7, 672 GB/s memory bandwidth, $1,699 at MicroCenter (retail $1,999). Single-slot design with four DisplayPort outputs.
**AMD Radeon AI R9700**: 32GB GDDR6, 640 GB/s memory bandwidth, approximately $1,300. Until the B70's release, this was the only GPU with 32GB VRAM available under $1,300.
For context, a single Nvidia RTX 5090 also has 32GB VRAM but costs just under $4,000. The B70 comes in at the lowest price but also has the lowest memory bandwidth of the three cards tested.
B60 Baseline Results (Recap from Previous Video)
Before testing the B70, Alex recapped his B60 benchmarks running the Qwen 34B model at Q4KM quantization to establish baseline expectations. Performance varied dramatically depending on the software stack:
**SYCL (Intel-specific stack via Llama CPP):**
- Concurrency 1: 1,000 tokens/sec prompt processing, 66 tokens/sec generation
- Concurrency 4: 898 tokens/sec prompt processing, 83 tokens/sec generation
**Vulkan (cross-platform via Llama CPP):**
- Concurrency 1: 1,162 tokens/sec prompt processing, 44 tokens/sec generation
- Vulkan beat SYCL on prompt processing, but SYCL won on token generation
**vLLM (4-bit AWQ quantization):**
- Concurrency 1: 8,118 tokens/sec prompt processing, 67 tokens/sec generation
- Concurrency 4: 215 tokens/sec token generation
vLLM showed massively better prompt processing performance and significantly better concurrency scaling than Llama CPP, making it the clear winner for professional GPU workloads.
Benchmarking Setup and Tools
Alex used vLLM as the inference server for all head-to-head GPU comparisons (not Llama CPP). The primary model tested was Qwen 34B in both full BF16 and AWQ (Activation Aware Quantization, 4-bit) versions.
For benchmarking, he used **Llama Beni** (by Yugger), an open-source benchmarking tool. He emphasized this is different from 'Llama Bench.' Key advantages of Llama Beni:
- Works with any server, not just Llama CPP (including vLLM)
- Allows prefilling context so you can test with filled contexts, not just empty context
vLLM's behavior is to consume as much available VRAM as possible — it fills all available memory for KV cache. On the RTX 4000, it consumed all 24GB; on the B70, all 32GB.
He verified GPU power with nvidia-smi, showing 145 watts available on the RTX Pro 4000.
nvidia-smi
B70 vs RTX Pro 4000: Full BF16 Qwen 34B
Running the full BF16 (unquantized) version of Qwen 34B on vLLM:
**Concurrency 1:**
- B70: 12,910 tokens/sec prompt processing, 56 tokens/sec generation
- RTX 4000: 11,745 tokens/sec prompt processing, 51 tokens/sec generation
- B70 wins on both metrics, which was unexpected since the RTX 4000 has significantly higher memory bandwidth (672 vs 68 GB/s)
Results were consistent on re-runs: 56 on the B70, 51 on the RTX 4000 for token generation. Prompt processing showed more variance on the B70 (second run hit 14,624). Time to first response was also faster on the B70.
**Concurrency 4:**
- B70: ~12,000 tokens/sec prompt processing, 194 tokens/sec generation
- RTX 4000: ~10,000 tokens/sec prompt processing, 173 tokens/sec generation
- B70 maintains its lead under load
The RTX 4000 has less VRAM (24 vs 32GB), costs almost twice as much, and delivered slightly lower performance on this model.
B70 vs RTX Pro 4000: AWQ 4-bit Quantization
Switching to AWQ (Activation Aware Quantization) — a 4-bit quantization format designed specifically for vLLM — produced different results. AWQ analyzes activations to identify which weights are most important to model quality and protects those weights from aggressive quantization, unlike uniform quantization in Llama CPP or MLX.
**Concurrency 1:**
- B70: 9,825 tokens/sec prompt processing, 72 tokens/sec generation
- RTX 4000: 11,490 tokens/sec prompt processing, 89 tokens/sec generation
- Nvidia wins significantly with AWQ quantization
**Concurrency 4:**
- B70: 9,981 tokens/sec prompt processing, 236 tokens/sec generation
- RTX 4000: 10,470 tokens/sec prompt processing, 275 tokens/sec generation
- The gap widens further at higher concurrency
This demonstrates that performance rankings can flip depending on model and quantization format. The RTX 4000's higher memory bandwidth appears to give it an advantage specifically with AWQ quantized models.
B70 vs AMD R9700: Full BF16 and AWQ
The AMD R9700 uses the ROCm software stack. Alex noted that ROCm has had a 'rocky history' but is catching up, though neither Intel nor AMD match Nvidia's software maturity.
**Qwen 34B BF16, Concurrency 1:**
- B70: 16,742 tokens/sec prompt processing, 56 tokens/sec generation
- R9700: 10,800 tokens/sec prompt processing, 43 tokens/sec generation
- B70 dominates despite the R9700 having higher memory bandwidth and costing $350 more
**Qwen 34B BF16, Concurrency 4:**
- B70: 12,564 prompt processing, 197 tokens/sec generation
- R9700: 9,879 prompt processing, 149 tokens/sec generation
**Qwen 34B AWQ, Concurrency 1:**
- B70: 9,281 prompt processing, 72 tokens/sec generation
- R9700: 13,733 prompt processing, 25 tokens/sec generation
- AMD had better prompt processing but catastrophically low token generation at 25 tokens/sec
**Qwen 34B AWQ, Concurrency 4:**
- B70: 10,300 prompt processing, 234 tokens/sec generation
- R9700: 8,900 prompt processing, 94 tokens/sec generation
Alex attributed the R9700's poor showing primarily to ROCm's software maturity, noting that Llama CPP might actually produce better results on this card than vLLM with ROCm.
Image and Video Generation: B70 vs R9700
Since both the B70 and R9700 have 32GB VRAM, Alex tested image and video generation workloads that require that much memory.
**Image Generation** using Qwen Image 2512 FP8 quantization at 1328x1328 resolution via ComfyUI:
- Both cards finished at approximately the same time
- At a smaller resolution: R9700 finished in 133 seconds, B70 in 147 seconds
However, there's an important caveat: the B70 was running ComfyUI 0.8.2 while the R9700 ran ComfyUI 0.18.1 (much newer). The B70 is limited to the version bundled in Intel's **LLM Scaler vLLM Omni package**, which includes Intel-specific patches, custom nodes for XPU (Intel's GPU label), IPEX (Intel PyTorch extension), ComfyUI GGUF with SYCL support, and preconfigured custom nodes. AMD's ROCm works through PyTorch's existing HIP/CUDA compatibility layer, allowing use of the latest ComfyUI.
**Video Generation** using LTX2, 5-second video at 1280x720:
- B70: 144 seconds
- R9700: 169 seconds
- B70 wins by about 25 seconds
Four B70s: Multi-GPU Scaling (128GB Total VRAM)
Alex installed four B70s for 128GB total VRAM. All four cards ran at PCIe Gen 5. However, while per-GPU bandwidth is 68 GB/s, GPU-to-GPU PCIe communication is limited to 63 GB/s, which creates a bottleneck.
**Qwen 34B AWQ (same model as single-GPU tests):**
- Concurrency 1: 18,170 tokens/sec prompt processing (nearly doubled from 9,281), but token generation dropped from 72 to 52 tokens/sec
- Concurrency 4: ~18,000 prompt processing, 183 tokens/sec generation (down from 234 with single GPU)
**Qwen 34B BF16:**
- Concurrency 1: 31,700 tokens/sec prompt processing (impressive for large context workloads)
- Concurrency 4: 172 tokens/sec generation (down from 197 with single GPU)
The pattern is clear: prompt processing scales well with more GPUs, but token generation actually decreases for models that fit on a single GPU. This is due to the PCIe communication overhead between GPUs. Multi-GPU only makes sense when you need the extra VRAM for larger models or larger context windows.
Four B70s with Larger Models: Qwen 3 Coder 30B A3B
To properly utilize 128GB of VRAM, Alex tested Qwen 3 Coder 30B A3B Instruct — a 30 billion parameter mixture-of-experts model with 3 billion active parameters. vLLM showed 100% GPU utilization across all four cards and consumed nearly all available VRAM (the extra space is used for KV cache and context).
**Concurrency 1:**
- 19,296 tokens/sec prompt processing
- 28 tokens/sec generation
Alex noted that 30-32 billion parameter models are a practical sweet spot for 128GB of VRAM. While you could technically run larger models, you'd lose headroom for KV cache and context. You need that extra VRAM beyond what the model weights consume.
He demonstrated the model running in the Zed code editor, using it for both chat and agentic coding tasks. The model responded but was noted as being 'an older model' and 'not the best model out there' — a limitation of Intel's software support lagging behind the latest model releases. He tested it by generating a Python Plotly script and a web application architecture design, showing decent GPU saturation on both memory and compute.
The Software Stack Problem
A recurring theme throughout all benchmarks: hardware capability is not the bottleneck — software support is. The key software stacks and their status as discussed:
**Nvidia (CUDA)**: The most mature ecosystem. Nvidia charges premium prices partly because their software layer is far ahead and has the broadest adoption.
**Intel (SYCL, IPEX, vLLM Omni)**: Capable hardware at great prices, but the software stack limits which models can run, which versions of tools like ComfyUI are available, and how quickly new models are supported. Intel needs to update model support more quickly.
**AMD (ROCm)**: Has had a historically rough software experience but is catching up. ROCm's advantage is that it works through PyTorch's HIP/CUDA compatibility layer, which means broader tool compatibility (like running the latest ComfyUI). However, inference performance on vLLM was poor compared to both Intel and Nvidia in these tests.
Alex emphasized checking model compatibility before purchasing — by the time you watch the video, newer models may or may not be supported on Intel's stack.
Key Takeaways
- The Intel Arc Pro B70 at under $1,000 with 32GB VRAM delivers competitive or better performance than the Nvidia RTX Pro 4000 ($1,699, 24GB) and AMD R9700 ($1,300, 32GB) in most vLLM benchmarks with Qwen 34B BF16, though Nvidia wins with AWQ quantization
- vLLM is the clear inference server choice for professional GPUs — it dramatically outperforms Llama CPP (SYCL and Vulkan backends) on prompt processing and concurrency scaling
- Multi-GPU scaling with B70s doubles prompt processing throughput but actually decreases token generation speed for models that fit on a single card, due to PCIe GPU-to-GPU communication bottleneck (63 GB/s limit)
- Performance rankings flip depending on quantization format — always benchmark your specific model and quantization combination rather than assuming one GPU universally wins
- Intel's biggest weakness is software, not hardware — limited model support, older ComfyUI versions, and slower updates for new models; check compatibility before buying
- For 128GB total VRAM (4x B70), 30-32B parameter models are the practical sweet spot — you need headroom beyond model weights for KV cache and context
- Use Llama Beni (not Llama Bench) for benchmarking — it works with any server including vLLM and supports prefilled context testing