Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I felt like I finally understood Shannon entropy when I realized that it's a subjective quantity -- a property of the observer, not the observed.

The entropy of a variable X is the amount of information required to drive the observer's uncertainty about the value of X to zero. As a correlate, your uncertainty and mine about the value of the same variable X could be different. This is trivially true, as we could each have received different information that about X. H(X) should be H_{observer}(X), or even better, H_{observer, time}(X).

As clear as Shannon's work is in other respects, he glosses over this.



What's often lost in the discussions about whether entropy is subjective or objective is that, if you dig a little deeper, information theory gives you powerful tools for relating the objective and the subjective.

Consider cross entropy of two distributions H[p, q] = -Σ p_i log q_i. For example maybe p is the real frequency distribution over outcomes from rolling some dice, and q is your belief distribution. You can see the p_i as representing the objective probabilities (sampled by actually rolling the dice) and the q_i as your subjective probabilities. The cross entropy is measuring something like how surprised you are on average when you observe an outcome.

The interesting thing is that H[p, p] <= H[p, q], which means that if your belief distribution is wrong, your cross entropy will be higher than it would be if you had the right beliefs, q=p. This is guaranteed by the concavity of the logarithm. This gives you a way to compare beliefs: whichever q gets the lowest H[p,q] is closer to the truth.

You can even break cross entropy into two parts, corresponding to two kinds of uncertainty: H[p, q] = H[p] + D[q||p]. The first term is the entropy of p and it is the aleatoric uncertainty, the inherent randomness in the phenomenon you are trying to model. The second term is KL divergence and it tells you how much additional uncertainty you have as the result of having wrong beliefs, which you could call epistemic uncertainty.


Thanks, that's an interesting perspective. It also highlights one of the weak points in the concept, I think, which is that this is only a tool for updating beliefs to the extent that the underlying probability space ("ontology" in this analogy) can actually "model" the phenomenon correctly!

It doesn't seem to shed much light on when or how you could update the underlying probability space itself (or when to change your ontology in the belief setting).


This kind of thinking will lead you to ideas like algorithmic probability, where distributions are defined using universal Turing machines that could model anything.


Amazing! I had actually heard about solomonoff induction before but my brain didn't make the connection. Thanks for the shortcut =)


You can sort of do this over a suitably large (or infinite) family of models all mixed, but from an epistemological POV that’s pretty unsatisfying.

From a practical POV it’s pretty useful and common (if you allow it to describe non- and semi-parametric models too).


Couldn't you just add a control (PID/Kalman filter/etc) to coverage on a stability of some local "most" truth?


Could you elaborate? To be honest I have no idea what that means.


I think what you're getting at is the construction of the sample space - the space of outcomes over which we define the probability measure (e.g. {H,T} for a coin, or {1,2,3,4,5,6} for a die).

Let's consider two possibilities:

1. Our sample space is "incomplete"

2. Our sample space is too "coarse"

Let's discuss 1 first. Imagine I have a special die that has a hidden binary state which I can control, which forces the die to come up either even or odd. If your sample space is only which side faces up, and I randomize the hidden state appropriately, it appears like a normal die. If your sample space is enlarged to include the hidden state, the entropy of each roll is reduced by one bit. You will not be able to distinguish between a truly random coin and a coin with a hidden state if your sample space is incomplete. Is this the point you were making?

On 2: Now let's imagine I can only observe whether the die comes up even or odd. This is a coarse-graining of the sample space (we get strictly less information - or, we only get some "macro" information). Of course, a coarse-grained sample space is necessarily an incomplete one! We can imagine comparing the outcomes from a normal die, to one which with equal probability rolls an even or odd number, except it cycles through the microstates deterministically e.g. equal chance of {odd, even}, but given that outcome, always goes to next in sequence {(1->3->5), (2->4->6)}.

Incomplete or coarse sample spaces can indeed prevent us from inferring the underlying dynamics. Many processes can have the same apparent entropy on our sample space from radically different underlying processes.


Right, this is exactly what I'm getting at - learning a distribution over a fixed sample space can be done with Bayesian methods, or entropy-based methods like the OP suggested, but I'm wondering if there are methods that can automatically adjust the sample space as well.

For well-defined mathematical problems like dice rolling and fixed classical mechanics scenarios and such, you don't need this I guess, but for any real-world problem I imagine half the problem is figuring out a good sample space to begin with. This kind of thing must have been studied already, I just don't know what to look for!

There are some analogies to algorithms like NEAT, which automatically evolves a neural network architecture while training. But that's obviously a very different context.


We could discuss completeness of the sample space, and we can also discuss completeness of the hypothesis space.

In Solomonoff Induction, which purports to be a theory of universal inductive inference, the "complete hypothesis space" consists of all computable programs (note that all current physical theories are computable, so this hypothesis space is very general). Then induction is performed by keeping all programs consistent with the observations, weighted by 2 terms: the programs prior likelihood, and the probability that program assigns to the observations (the programs can be deterministic and assign probability 1).

The "prior likelihood" in Solomonoff Induction is the program's complexity (well, 2^(-Complexity), where the complexity is the length of the shortest representation of that program.

Altogether, the procedure looks like: maintain a belief which is a mixture of all programs consistent with the observations, weighted by their complexity and the likelihood they assign to the data. Of course, this procedure is still limited by the sample/observation space!

That's our best formal theory of induction in a nutshell.


Someone else pointed me to Solomonoff induction too, which looks really cool as an "idealised" theory of induction and it definitely solves my question in abstract. But there are obvious difficulties with that in practice, like the fact that it's probably uncomputable, right?

I mean I think even the "Complexity" coefficient should be uncomputable in general, since you could probably use a program which computes it to upper bound "Complexity", and if there was such an upper bound you could use it to solve the halting problem etc. Haven't worked out the details though!

Would be interesting if there are practical algorithms for this. Either direct approximations to SI or maybe something else entirely that approaches SI in the limit, like a recursive neural-net training scheme? I'll do some digging, thanks!


Correct anything thats wrong here. Cross entropy is the comparison of two distributions right? Is the objectivity sussed out in relation to the overlap cross section. And is the subjectivity sussed out not on average but deviations on average? Just trying to understand it in my framework which might be wholly off the mark.


Cross entropy lets you compare two probability distributions. One way you can apply it is to let the distribution p represent "reality" (from which you can draw many samples, but whose numerical value you might not know) and to let q represent "beliefs" (whose numerical value is given by a model). Then by finding q to minimize cross-entropy H[p, q] you can move q closer to reality.

You can apply it other ways. There are lots of interpretations and uses for these concepts. Here's a cool blog post if you want to find out more: https://blog.alexalemi.com/kl-is-all-you-need.html


I'm not sure what you mean by objectivity and subjectivity in this case.

With the example of beliefs, you can think of cross entropy as the negative expected value of the log probability you assigned to an outcome, weighted by the true probability of each outcome. If you assign larger log probabilities to more likely outcomes, the cross entropy will be lower.


This doesn't really make entropy itself observer dependent. (Shannon) entropy is a property of a distribution. It's just that when you're measuring different observers' beliefs, you're looking at different distributions (which can have different entropies the same way they can have different means, variances, etc).


Entropy is a property of a distribution, but since math does sometimes get applied, we also attach distributions to things (eg. the entropy of a random number generator, the entropy of a gas...). Then when we talk about the entropy of those things, those entropies are indeed subjective, because different subjects will attach different probability distributions to that system depending on their information about that system.


Some probability distributions are objective. The probability that my random number generator gives me a certain number is given by a certain formula. Describing it with another distribution would be wrong.

Another example, if you have an electron in a superposition of half spin-up and half spin-down, then the probability to measure up is objectively 50%.

Another example, GPT-2 is a probability distribution on sequences of integers. You can download this probability distribution. It doesn't represent anyone's beliefs. The distribution has a certain entropy. That entropy is an objective property of the distribution.


Of those, the quantum superposition is the only one that has a chance at being considered objective, and it's still only "objective" in the sense that (as far as we know) your description provided as much information as anyone can possibly have about it, so nobody can have a more-informed opinion and all subjects agree.

The others are both partial-information problems which are very sensitive to knowing certain hidden-state information. Your random number generator gives you a number that you didn't expect, and for which a formula describes your best guess based on available incomplete information, but the computer program that generated knew which one to choose and it would not have picked any other. Anyone who knew the hidden state of the RNG would also have assigned a different probability to that number being chosen.


You might have some probability distribution in your head for what will come out of GPT-2 on your machine at a certain time, based on your knowledge of the random seed. But that is not the GPT-2 probability distribution, which is objectively defined by model weights that you can download, and which does not correspond to anyone’s beliefs.


I'm of the view that strictly speaking, even a fair die doesn't have a probability distribution until you throw it. It just so happens that, unless you know almost every detail about the throw, the best you can usually do is uniform.

So I would say the same of GPT-2. It's not a random variable unless you query it. But unless you know unreasonably many details, the best you can do to predict the query is the distribution that you would call "objective."


I think this gets into unanswerable metaphysical questions about when we can say mathematical objects, propositions, etc. really exist.

But I think if we take the view that it's not a random variable until we query it, that makes it awkward to talk about how GPT-2 (and similar models) is trained. No one ever draws samples from the model during training, but the whole justification for the cross-entropy-minimizing training procedure is based on thinking about the model as a random variable.


A more plausible way to argue for objectiveness is to say that some probability distributions are objectively more rational than others given the same information. E.g. when seeing a symmetrical die it would be irrational to give 5 a higher probability than the others. Or it seems irrational to believe that the sun will explode tomorrow.


The probability distribution is subjective for both parts -- because it, once again, depends on the observer observing the events in order to build a probability distribution.

E.g. your random number generator generates 1, 5, 7, 8, 3 when you run it. It generates 4, 8, 8, 2, 5 when I run it. I.e. we have received different information about the random number generator to build our subjective probability distributions. The level of entropy of our probability distributions is high because we have so little information to be certain about the representativeness of our distribution sample.

If we continue running our random number generator for a while, we will gather more information, thus reducing entropy, and our probability distributions will both start converging towards an objective "truth." If we ran our random number generators for a theoretically infinite amount of time, we will have reduced entropy to 0 and have a perfect and objective probability distribution.

But this is impossible.


Would you say that all claims about the world are subjective, because they have to be based on someone’s observations?

For example my cat weighs 13 pounds. That seems objective, in the sense that if two people disagree, only one can be right. But the claim is based on my observations. I think your logic leads us to deny that anything is objective.


I do believe in objective reality, but probabilities are subjective. Your cat weighs 13 pounds, and now that you've told me, I know it too. If you asked me to draw a probability distribution for the weight of your cat, I'd draw a tight gaussian distribution around that, representing the accuracy of your scale. My cat weighs a different amount, but I won't tell you how much, so if we both draw a probability distribution, they'll be different. And the key thing is that neither of us has an objectively correct probability distribution, not even me. My cat's weight has an objectively correct value which even I don't know, because my scale isn't good enough.


All right now, here's the big question: how do you know that the evidence your sensory apparatus reveals to you is correct? What I'm getting at is this: the only experience that is directly available to you is your sensory data. And this sensory data is merely a stream of electrical impulses which stimulate your computing center. In other words, all that I really know about the outside universe is relayed to me through my electrical connections.

          Why, that would mean that...I really don't
          know what the outside universe is like at
          all, for certain.


Sorry, this is a major misinterpretation, or at least a completely different one. I don't know how to put it in a more productive way; I think your comment is very confused. You don't need to run a random number generator "for a while" in order to build up a probability distribution.


A representative sample then? Please tell me where I went wrong -- I mean this sincerely.


This might be a frequentist vs bayesian thing, and I am bayesian. So maybe other people would have a different view.

I don't think you need to have any information to have a probability distribution; your distribution already represents your degree of ignorance about an outcome. So without even sampling it once, you already should have a uniform probability distribution for a random number generator or a coin flip. If you do personally have additional information to help you predict the outcome -- you're skilled at coin-flipping, or you wrote the RNG and know an exploit -- then you can compress that distribution to a lower-entropy one.

But you don't need to sample the distribution to do this. You can have that information before the first coin toss. Sampling can be one way to get information but it won't necessarily even help. If samples are independent, then each sample really teaches you barely anything about the next. RNGs eventually do repeat so if you sample it enough you might be able to find the pattern and reduce the entropy to zero, but in that case you're not learning the statistical distribution, you're deducing the exact internal state of the RNG and predicting the exact next outcome, because the samples are not actually independent. If you do enough coin flips you might eventually find that there's a slight bias to the coin, but that really takes an extreme number of tosses and only reduces the entropy a tiny tiny bit; not at all if the coin-tossing procedure had no bias to begin with.

However the objective truth is just that the next toss will land heads. That's the only truth that experiment can objectively determine. Any other doubt that it might-have-counterfactually-landed-tails is subjective, due to a subjective lack of sufficient information to predict the outcome. We can formalize a correct procedure to convert prior information into a corresponding probability distribution, we can get a unanimous consensus by giving everybody the same information, but the probability distribution is still subjective because it is a function of that prior information.


I only slightly understand, I'm sorry; I'm not educated enough to understand much of this.

Did you take stats at MIT? I'm going to through their online material, because I very much am very confused.


I appreciate your curiosity!

The best introduction that I can recommend is this type-written PDF from E.T. Jaynes, called "probability theory with applications in science and engineering": https://bayes.wustl.edu/etj/science.pdf.html

It requires a lot of attention to read and follow the math, but it's worthwhile. Jaynes is a pretty passionate writer, and in his writing he's clearly battling against some enemies (who might be ghosts), but on the other hand this also makes for more entertaining reading and I find that's usually a benefit when it comes to a textbook.


I read through the first "lecture" yesterday. I'll devote some time for (hopefully) the rest today.

Thank you!


"Entropy is a property of matter that measures the degree of randomization or disorder at the microscopic level", at least when considering the second law.


Right, but the very interesting thing is it turns out that what's random to me might not be random to you! And the reason that "microscopic" is included is because that's a shorthand for "information you probably don't have about a system, because your eyes aren't that good, or even if they are, your brain ignored the fine details anyway."


Right but in chemistry class the way it’s taught via Gibbs free energy etc. makes it seem as if it’s an intrinsic property.


Entropy in physics is usually the Shannon entropy of the probability distribution over system microstates given known temperature and pressure. If the system is in equilibrium then this is objective.


Entropy in Physics is usually either the Boltzmann or Gibbs entropy, both of whom were dead before Shannon was born.


That's not a problem, as the GP's post is trying to state a mathematical relation not a historical attribution. Often newer concepts shed light on older ones. As Baez's article says, Gibbs entropy is Shannon's entropy of an associated distribution(multiplied by the constant k).


It is a problem because all three come with a bagage. Almost none of the things discussed in this thread are invalid when discussing actual physical entropy even though the equations are superficially similar. And then there are lots of people being confidently wrong because they assume that it’s just one concept. It really is not.


Don't see how the connection is superficial. Even the classical macroscopic definition of entropy as ΔS=∫TdQ can be derived from the information theory perspective as Baez shows in article(using entropy maximizing distributions and Lagrange multipliers). If you have a more specific critique, it would be good to discuss.


In classical physics there is no real objective randomness. Particles have a defined position and momentum and those evolve deterministically. If you somehow learned these then the shannon entropy is zero. If entropy is zero then all kinds of things break down.

So now you are forced to consider e.g. temperature an impossibility without quantum-derived randomness, even though temperature does not really seem to be a quantum thing.


> If entropy is zero then all kinds of things break down.

Entropy is a macroscopic variable and if you allow microscopic information, strange things can happen! One can move from a high entropy macrostate to a low entropy macrostate if you choose the initial microstate carefully. But this is not a reliable process which you can reproduce experimentally, ie. it is not a thermodynamic process.

A thermodynamics process P is something which takes a macrostate A to a macrostate B, independent of which microstate a0, a1, a2.. in A you started off with it. If the process depends on microstate, then it wouldn't be something we would recognize as we are looking from the macro perspective.


> Particles have a defined position and momentum

Which we don’t know precisely. Entropy is about not knowing.

> If you somehow learned these then the shannon entropy is zero.

Minus infinity. Entropy in classical statistical mechanics is proportional to the logarithm of the volume in phase space. (You need an appropriate extension of Shannon’s entropy to continuous distributions.)

> So now you are forced to consider e.g. temperature an impossibility without quantum-derived randomness

Or you may study statistical mechanics :-)


> Which we don’t know precisely. Entropy is about not knowing.

No, it is not about not knowing. This is an instance of the intuition from Shannon’s entropy does not translate to statistical Physics.

It is about the number of possible microstates, which is completely different. In Physics, entropy is a property of a bit of matter, it is not related to the observer or their knowledge. We can measure the enthalpy change of a material sample and work out its entropy without knowing a thing about its structure.

> Minus infinity. Entropy in classical statistical mechanics is proportional to the logarithm of the volume in phase space.

No, 0. In this case, there is a single state with p=1 and and S = - k Σ p ln(p) = 0.

This is the same if you consider the phase space because then it is reduced to a single point (you need a bit of distribution theory to prove it rigorously but it is somewhat intuitive).

The probability p of an microstate is always between 0 and 1, therefore p ln(p) is always negative and S is always positive.

You get the same using Boltzmann’s approach, in which case Ω = 1 and S = k ln(Ω) is also 0.

> (You need an appropriate extension of Shannon’s entropy to continuous distributions.)

Gibbs’ entropy.

> Or you may study statistical mechanics

Indeed.


>>> Particles have a defined position and momentum [...] If you somehow learned these then the shannon entropy is zero.

>> Entropy in classical statistical mechanics is proportional to the logarithm of the volume in phase space [and diverges to minus infinity if you define precisely the position and momentum of the particles and the volume in phase sphere goes to zero]

> [It's zero also] if you consider the phase space because then it is reduced to a single point (you need a bit of distribution theory to prove it rigorously but it is somewhat intuitive).

> The probability p of an microstate is always between 0 and 1, therefore p ln(p) is always negative and S is always positive.

The points in the phase space are not "microstates" with probability between 0 and 1. It's a continuous distribution and if it collapses to a point (i.e. you somehow learned the exact positions and momentums) the density at that point is unbounded. The entropy is also unbounded and goes to minus infinity as the volume in phase space collapses to zero.

You can avoid the divergence by dividing the continuous phase space into discrete "microstates" but having a well-defined "microstate" corresponding to some finite volume in phase space is not the same as what was written above about "particles having a defined position and momentum" that is "somehow learned". The microstates do not have precisely defined positions and momentums. The phase space is not reduced to a single point in that case.

If the phase space is reduced to a single point I'd like to see your proof that S(ρ) = −k ∫ ρ(x) log ρ(x) dx = 0


I hadn't realized that "differential" entropy and shannon entropy are actually different and incompatible, huh.

So the case I mentioned, where you know all the positions and momentums has 0 shannon entropy and -Inf differential entropy. And a typical distribution will instead have Inf shannon entropy and finite differential entropy.

Wikipedia has some pretty interesting discussion about Differential Entropy vs Limiting density of Points, but I can't claim to understand it and whether it could bridge the gap here.


> So the case I mentioned, where you know all the positions and momentums has 0 shannon entropy

No, Shannon entropy is not applicable in that case.

https://en.wikipedia.org/wiki/Entropy_(statistical_thermodyn...

Quantum mechanics solves the issue of the continuity of the state space. However, as you probably know, in quantum mechanics all the positions and momentums cannot simultaneously have definite values.


> possible microstates

Conditional on the known macrostate. Because we don’t know the precise microstate - only which microstates are possible.

If your reasoning is that « experimental entropy can be measured so it’s not about that » then it’s not about macrostates and microstates either!


> In Physics, entropy is a property of a bit of matter, it is not related to the observer or their knowledge. We can measure the enthalpy change of a material sample and work out its entropy without knowing a thing about its structure.

Enthalpy is also dependent on your choice of state variables, which is in turn dictated by which observables you want to make predictions about: whether two microstates are distinguishable, and thus whether the part of the same macrostate, depends on the tools you have for distinguishing them.


A calorimeter does not care about anyone’s choice of state variables. Entropy is not only something that exists in abstract theoretical constructs, it is something we can get experimentally.


that's actually the normal view, with saying both info and stat mech entropy are the same is an outlier, most popularized by Jaynes.


If information-theoretical and statistical mechanics entropies are NOT the same (or at least, deeply connected) then what stops us from having a little guy[0] sort all the particles in a gas to extract more energy from them?

[0] https://en.wikipedia.org/wiki/Maxwell%27s_demon


Sounds like a non-sequitur to me; what are you implying about the Maxwell's demon thought experiment vs the comparison between Shannon and stat-mech entropy?


Yeah but distributions are just the accounting tools to keep track of your entropy. If you are missing one bit of information about a system, your understanding of the system is some distribution with one bit of entropy. Like the original comment said, the entropy is the number of bits needed to fill in the unknowns and bring the uncertainty down to zero. Your coin flips may be unknown in advance to you, and thus you model it as a 50/50 distribution, but in a deterministic universe the bits were present all along.


Trivial example: if you know the seed of a pseudo-random number generator, a sequence generated by it has very low entropy.

But if you don't know the seed, the entropy is very high.


Theoretically, it's still only the entropy of the sneed-space + time-space it could have been running in, right?


To shorten this for you with my own (identical) understanding: "entropy is just the name for the bits you don't have".

Entropy + Information = Total bits in a complete description.


It's an objective quantity, but you have to be very precise in stating what the quantity describes.

Unbroken egg? Low entropy. There's only one way the egg can exist in an unbroken state, and that's it. You could represent the state of the egg with a single bit.

Broken egg? High entropy. There are an arbitrarily-large number of ways that the pieces of a broken egg could land.

A list of the locations and orientations of each piece of the broken egg, sorted by latitude, longitude, and compass bearing? Low entropy again; for any given instance of a broken egg, there's only one way that list can be written.

Zip up the list you made? High entropy again; the data in the .zip file is effectively random, and cannot be compressed significantly further. Until you unzip it again...

Likewise, if you had to transmit the (uncompressed) list over a bandwidth-limited channel. The person receiving the data can make no assumptions about its contents, so it might as well be random even though it has structure. Its entropy is effectively high again.


Baez has a video (accompanying, imho), with slides

https://m.youtube.com/watch?v=5phJVSWdWg4&t=17m

He illustrates the derivation of Shannon entropy with pictures of trees


> it's a subjective quantity -- a property of the observer, not the observed

Shannon's entropy is a property of the source-channel-receiver system.


Can you explain this in more detail?

Entropy is calculated as a function of a probability distribution over possible messages or symbols. The sender might have a distribution P over possible symbols, and the receiver might have another distribution Q over possible symbols. Then the "true" distribution over possible symbols might be another distribution yet, call it R. The mismatch between these is what leads to various inefficiencies in coding, decoding, etc [1]. But both P and Q are beliefs about R -- that is, they are properties of observers.

[1] https://en.wikipedia.org/wiki/Kullback–Leibler_divergence#Co...


> he glosses over this

All of information theory is relative to the channel. This bit is well communicated.

What he glosses over is the definition of "channel", since it's obvious for electromagnetic communications.



shannon entropy is subjective for bayesians and objective for frequentists


The entropy is objective if you completely define the communication channel, and subjective if you weave the definition away.


the subjectivity doesn't stem from the definition of the channel but from the model of the information source. what's the prior probability that you intended to say 'weave', for example? that depends on which model of your mind we are using. frequentists argue that there is an objectively correct model of your mind we should always use, and bayesians argue that it depends on our prior knowledge about your mind


(i mean, your information about what the channel does is also potentially incomplete, so the same divergence in definitions could arise there too, but the subjectivity doesn't just stem from the definition of the channel; and shannon entropy is a property that can be imputed to a source independent of any channel)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: