> Venture capitalists & private investors are sucking all of the possible growth and future upside from these companies and then dumping them on retail investors when there's nothing left.
A lot of the money that is deployed by VCs comes from pension funds and asset managers that ultimately manage money for the average Joe.
I haven't read TFA yet but a common technique is speculative decoding where a fast draft model will generate X tokens, which are then verified by the larger target model. The target model may accept some Y <= X tokens but the speedup comes from the fact that this can be done in parallel as a prefill operation due to the nature of transformers.
So let's say a draft model generates 5 tokens, all 5 of these can be verified in parallel with a single forward pass of the target model. The target model may only accept the first 4 tokens (or whatever) but as long as the 5 forward passes of the draft model + 1 prefill of the target model is faster than 4 forward passes of the target, you will have a speedup while maintaining the exact output distribution as the target.
From what I understand, in practice it often is true[1]:
Matrix multiplication should be “independent” along every element in the batch — neither the other elements in the batch nor how large the batch is should affect the computation results of a specific element in the batch. However, as we can observe empirically, this isn’t true.
In other words, the primary reason nearly all LLM inference endpoints are nondeterministic is that the load (and thus batch-size) nondeterministically varies! This nondeterminism is not unique to GPUs — LLM inference endpoints served from CPUs or TPUs will also have this source of nondeterminism.
"But why aren’t LLM inference engines deterministic? One common hypothesis is that some combination of floating-point non-associativity and concurrent execution leads to nondeterminism based on which concurrent core finishes first."
> With prompt caching, verbose context that gets reused is basically free.
But it's not. It might be discounted cost-wise, however it will still degrade attention and make generation slower/more computationally expensive even if you have a long prefix you can reuse during prefill.
> Tradition warrants a negotiation phase when one party wishes to change the terms of an agreement, or becomes cognizant that the counterparty may wish to do the same.
They didn't change the agreement. One party violated it, and the other party withdrew as a result.
This is so vanilla. But people will moan because they want subsidized tokens.
I don't have a pony in this race my good poster, I just calls it how I see it, and I have a long history of calling out the fundamentally abusive character on non-negotiable one way contracting, and the ill effects it has on society.
Only people moaning here seem to be a bunch of wannabe Google PO's upset that people are handing machines a data construct they are designed to accept, and the machine is accepting, and using the token the way they were designed. Looks for some reason Google appears to resent that their lack of automating checks to deny those OAuth tokens is being utilized, and seems to think termination of customers who could probably be corrected with a simple message is the most reasonable response.
With instincts like that, it makes me happy everyday that for my needs, I can make do with doing things on my own hardware I've collected over the years. The Cloud has too much drama potential tied up in it.
A lot of the money that is deployed by VCs comes from pension funds and asset managers that ultimately manage money for the average Joe.
reply