I think fine-tuning still matters for production problems where you need deterministic, auditable behavior or to reliably reduce hallucinations that clever prompting alone cannot eliminate. In my experience the best pragmatic approach is parameter efficient tuning, for example LoRA or QLoRA with bitsandbytes for 4-bit training to keep costs down, paired with a RAG layer over a FAISS vector DB so you do not stuff the model context and blow your token budget. I've found that managing a few tuned adapters and a small ops pipeline is a simpler, cheaper long term tradeoff than endless prompt gymnastics, and it saves you from praying to the prompt gods every time requirements creep.
This time even Unsloth could not provide bitsandbytes 4-bit models. bitsandbytes does not support new models with MoE and linear attentions, and it's much less flexible than GGUF. Nowadays I think it's better to train lora over GGUF base model, see the discussion at https://github.com/huggingface/transformers/issues/40070
I'll find some time to do this and I hope someone can do this earlier than me.