What ? I've never heard of this math. Do you mean literally those are the resulting operations in the general case or are those approximate explanation and we need to find more specific cases to make this true ?
Max and Plus at the random variables space becomes product and convolution in their distribution function space.
Distr(X+Y) = DistrX ° DistrY
Distr (X ^ Y) = DistrX * DistrY.
Where '^' denotes Max and '°' denotes convolution.
Note *, +, ° and ^ being commutative and associative they can be chained. One can also use their distributive properties. This really the math of groups and rings.
However, one can and one does resort to approximations to compute the desired end results.
More specifically, people are often interested not in the distribution, but some statistics. For example, mean, standard deviation, some tail percentile etc. To compute those stats from the exact distributions, approximations can be employed.
Max of variables = product of cumulative distribution functions.
Sum of variables = convolution of probability density functions.
So both of the equations you write down are correct, but only if you interpret "Distr" as meaning different things in the two cases.
[EDITED to add:] Provided the random variables in question are independent, as mentioned elsewhere in the discussion; if they aren't then none of this works.
The original post, to which I replied, is about the correspondence between summation of random variables and convolution of their distribution. Independence is sufficient for that.
I just carried through that assumption of independence in my own comment, thinking it was obvious to do that (carry over the assumptions).
Both the ideas work for the cumulative distribution function that is called just distribution function in math. I think he got confused by the fact that convolution relation works also with densities (so he might have assumed that it works with densities only and not distributions)
I'm sorry, but I think you are just wrong about convolutions and cumulative distribution functions.
Let's take the simplest possible case: a "random" variable that's always equal to 0. Its cdf is a step function: 0 for negative values, 1 for positive values. (Use whatever convention you prefer for the value at 0.)
The sum of two such random variables is another with the same distribution, of course. So, what's the convolution of the cdfs?
Answer: it's not even well defined.
The convolution of functions f and g is a function such that h(x) = integral over t of f(t) g(x-t) dt. The integral is over the whole of (in this case) the real numbers.
In this case f and g are both step functions as described above, so (using the convenient Iverson bracket notation for indicator functions) this is the integral over t of [t>0] [x-t>0] dt, i.e., of [0<t<x] dt, whose value is 0 for negative x and x for positive x.
This is not the cdf of any probability distribution since it doesn't -> 1 as x -> oo. In particular, it isn't the cdf of the right probability distribution which, as mentioned above, would be the same step function as f and g.
If X,Y are independent with pdfs f,g and cdfs F,G then the cdf of X+Y is (not F conv G but) f conv G = F conv g.
Oops, one thing in the above is completely wrong (I wrote it before thinking things through carefully, and then forgot to delete it).
It is not at all true that "it's not even well defined", and indeed the following couple of paragraphs determine exactly what the thing in question is. It's not an actual cdf, but the problem isn't that it's ill-defined but that it's well-defined but has the wrong shape to be a cdf.
Thank you for your welcome, I must have been lurking here for around 30 years or more (always changing accounts). Anyway in this specific case, since M = Max(X,X) = X you can't have F(M) = F(X)*F(X) = F(X) except when F(X) in {0,1}, so the independence property is essential. Welcome fellow Lisper (for the txr and related submission) and math inspired (this one and another related to statistical estimation) with OS related interest (your HN account), OS are not my cup of tea but awk is not bad).
In another post there are some comments between topology and deep learning. I wonder if there is a definition similar to dimension in topology which would allow you to estimate the minimal size (number of parameters) in a neural network so that is able to achieve a certain state (for example obtaining the capacity to one shot learning with high probability).
Yes independence is absolutely an assumption that I (implicitly) made. It's essential for the convolution identity to hold as well, I just carried through that assumption.
We share interest in AWK (*) then :) I don't know OS at all. Did you imply I know lisp ? I enjoy scheme, but used it in anger never. Big fan of the little schemer series of books.
(*) Have to find that Weinberger face Google-NY t-shirt. Little treasures.
Regarding your dimensions comment, this is well understood for a single layer, that is, for logistic regression. Lehmann's book will have the necessary material. With multiple layers it gets complicated real fast.
The best performance estimates, as in, within realms of being practically useful, largely come from two approaches, one from PAC-Bayesian bounds, the other from Statistical Physics (but these bounds are data distribution dependent). The intrinsic dimension of the data plays a fundamental role there.
The recommended place to dig around is JMLR (journal of machine learning research).
Perhaps your txr submission suggests a lisp flavor. The intrinsic dimension concept looks interesting, also the V.C. dimension, but both concepts are very general. Perhaps Lehmann's book is: Elements of large sample theory.
I meant Lehmann's Theory of Point Estimation, but large sample theory is a good book too. The newer editions of TPE are a tad hefty in number of pages. The earlier versions would serve you fine.
The generic idea is that smaller these dimensions, easier the prediction problem. Intrinsic dimension is one that comes closest to topology. VC is very combinatorial and gives the worst of worst case bounds. For a typical sized dataset one ends up with an error probability estimate of less than 420. With PAC-Bayes the bounds are atleast less than 1.0.