Disqus Comments

- −
- +
Nitin Bansal a year ago
Great Write up! Helped me really improve my understanding about the subject!
I had few small doubts, which I am not able to comprehend.
1) Why do we talk about Joint Probability Distribution, only when we discuss about Wasserstein Loss and not when we discuss about KL divergence?
Regards,
Nitin
see more
32
Reply
Share ›
- - −
  - +
  llmorpheull Nitin Bansal 8 months ago
  disqus_ILLNP6vaYW gg
  see more
  0
  Reply
  Share ›
  Show more replies
Show more replies
- −
- +
Ravi Teja 2 years ago
Hi Alex, great post! .In the parellel lines example, when explaining about KL and reverse KL divergence, the points(theta, 0) and (theta, 0.5) should be switched I think. Q(x,y) cannot be zero at (theta,0.5)
see more
2
Reply
Share ›
- - −
  - +
  Markus Wenzel Ravi Teja a year ago
  Also, if I understood things correctly here, the statement holds for _any_ point (0,y) and (theta,y), respectively, as the distributions are completely disjoint. Finding only one (like (0,0.5)) is sufficient, though, for the argument.
  see more
  0
  Reply
  Share ›
  Show more replies
Show more replies
- −
- +
Bain Jammin 2 years ago
When showing that maximizing MLE is equivalent to minimizing KL, you should take argmax and argmin instead of max/min since adding a constant does indeed modify the max/min but not the argmax/argmin.
see more
2
Reply
Share ›
Show more replies
- −
- +
Rafael Valle 2 years ago edited
Alex, thank you for this great post.
In the part of the text where you define the probability distributions over R^2 to be (0, y) and compare distance functions, con you add a line to compare it with least-squares as well?
see more
1
Reply
Share ›
Show more replies
- −
- +
HS Choi 2 years ago
Hi I'm reading through your article and have a question about a mathematical term.
Could you tell me what does "dimensional support" mean?? as in P_theta has low dimensional support. I've tried to google it but failed to figure out what that means. It would be a lot of help.
Thanks :)
see more
1
Reply
Share ›
- - −
  - +
  Csaba Botos HS Choi 2 years ago
  As far as I know, a general function's (i.e. f:X->Y) support is a subset of X where f(x) is non zero
  see more
  0
  Reply
  Share ›
  Show more replies
Show more replies
- −
- +
Karan Desai 3 years ago
Amazing, you broke down the math into very concise and understandable chunks. The paper scared me, but I managed to understand it by having this read through simultaneously. Keep up the good work !
see more
1
Reply
Share ›
Show more replies
- −
- +
Rajarshi Banerjee a year ago edited
How on earth did you get support of (0, 0.5) or (theta, 0.5) to be zero, z is uniformly distributed from 0, 1. Just how do you get 0 as a support for that
see more
0
Reply
Share ›
Show more replies
- −
- +
Shashank Gupta 2 years ago
Many thanks for this great blog post! :)
see more
0
Reply
Share ›
Show more replies
- −
- +
Marco Singh 2 years ago
Thanks a lot for this excellent explanation! :-)
I have a question regarding the conditions for the marginal distribution you define: "The amount of mass that leaves x is ∫y γ(x,y)dy. This
must equal P_r(x), the amount of mass originally at x.". If we have two distributions and we want to move mass around, your definition implies that all the original probability mass at P_r(x) should be moved. Wouldn't it be the case that we only want to move as little as possible from P_r(x) to go from P_r to P_theta, i.e. that it shouldn't necessarily be all the mass that should be moved?
This explanation defines it differently fyi: https://vincentherrmann.git...
see more
0
Reply
Share ›
- - −
  - +
  Marco Singh Marco Singh 2 years ago
  I figured it out I think. Since we sum over all possible y, the case where x = y is contained, hence we kinda move mass with a distance of 0 in the case where mass isn't moved. This way the EM/Wasserstein distance isn't penalized since the mass is multiplied with a distance of 0.
  see more
  0
  Reply
  Share ›
  Show more replies
Show more replies
- −
- +
Siddarth Malreddy 2 years ago
Thanks for the article. Could you explain what you mean when you say "we make the generator a feedforward net instead of a convolutional one"?
see more
0
Reply
Share ›
- - −
  - +
  Khalid EL Siddarth Malreddy 2 years ago
  if you change the architecture of the neural networks representing the generator , I guess that we obtain bad samples when we switch to MLP , as convolutional neural networks are well suited to images as they encode the translation invariance property for example (with local filters applied with shared weights in the same layer) .
  see more
  0
  Reply
  Share ›
  Show more replies
Show more replies
- −
- +
Roger Trullo 2 years ago
Thanks for the post! I have a question, if I understand correctly, we first should train the critic, in order to have the function fw which will be used in the Wasserstein computation. This is done by maximizing E[fw(x)]-E[fw(gtheta(z))]. One we are done with that, we can now minimize the distance between Pr and Ptheta by minimizing -E[fw(gtheta(z))]. I am checking the original code posted here : https://github.com/martinar... , and they seem to be minimizing E[fw(x)]-E[fw(gtheta(z))] first, and then minimizing E[fw(gtheta(z))]. I am trying to understand why is that, if these two things are equivalent, why is that? thanks!
see more
0
Reply
Share ›
- - −
  - +
  Zhiwen Lin Roger Trullo 2 years ago edited
  Hi Roger, I have same question about it. Do you figure it out?
  see more
  0
  Reply
  Share ›
  Show more replies
- - −
  - +
  alexirpan Mod Roger Trullo 2 years ago
  I'm not great at reading PyTorch, but I think the code is doing the correct thing. Note one = 1 and mone is the negative.
  Line 189: Train discriminator on real given label one.
  Line 197: Train discriminator on fake given label mone.
  Line 213: Train generator on fake given label one, which is adversarial compared to how the discriminator wants to do things.
  see more
  0
  Reply
  Share ›
  Show more replies
Show more replies
- −
- +
Michael Dietz 2 years ago
The weight clipping requirement introduced seems a bit extreme. Just thinking about it would this sort of weight clipping [-.01, .01] limit the ability of the critic/discriminator to represent non-linear functions? Considering we use non-linearities like relus for example, with weights clipped to this range wouldn't relus very rarely actually introduce any non-linearities in our network (their original purpose). same goes with most other activations i can think of. i guess this is done on purpose for the critic to be k-Lipschitz? (maybe biases could allow non linearities to be introduced but are there any limitations here?) but isn't this weight clipping a hack then and doesn't it constrain us to critics that only represent linear functions which defeats purpose? is this a big problem? hoping i am not understanding or missing something because otherwise Wasserstein seems very promising to me. Thanks, Mike
see more
0
Reply
Share ›
- - −
  - +
  alexirpan Mod Michael Dietz 2 years ago
  ReLUs introduce non-linearities when some outputs are < 0 and other outputs are > 0. This doesn't have anything to do with the magnitude of the weights - I don't think weight clipping makes the network "more linear", or only able to represent linear functions.
  However, it does limit what kinds of functions you can learn. The weight clipping is needed to make the theory carry through, but you can always clip to a larger range if you need to.
  see more
  0
  Reply
  Share ›
  - −
    +
    Michael Dietz alexirpan 2 years ago
    thanks for clearing that up, great post btw!
    see more
    0
    Reply
    Share ›
    Show more replies
  Show more replies
Show more replies
- −
- +
洪語祥 3 years ago
Great read-through! Thanks for this article!
see more
0
Reply
Share ›
Show more replies
- −
- +
udibr 3 years ago
Thank you!
In the formula just before Algo 1 you accidentaly wrote g_{\theta}(x) instead of g_{\theta}(z)
see more
0
Reply
Share ›
Show more replies
- −
- +
gondolier 3 years ago
The gradient computation \Del_\theta W(P_r,P_\theta) seems wrong because the optimal 1-Lip f_w that for the pair (P_r, P_\theta) depends on theta so the gradient of the second term is not zero.
Is this mistake not important?
see more
0
Reply
Share ›
- - −
  - +
  alexirpan Mod gondolier 3 years ago
  We're updating with the partial derivative with respect to \theta, not the total derivative, so we don't need to worry about the effects of other variables.
  see more
  0
  Reply
  Share ›
  - −
    +
    gondolier alexirpan 3 years ago
    No I meant just partial derivative wrt theta of course. The calculation is correct for a fixed f_w. But your f_w is the maximizer that achieves the supremum in W(Pr,P_theta), so it depends on theta, which you can write as f_{w,theta} if you will.
    Simple example: take theta=sigma, Pr=N(0,1) and P_theta=N(0,theta^2). Find what f is.
    Therefore Del_theta of E_{x~Pr} f(x) is not zero, because f depends on theta. So the expression is incorrect.
    see more
    0
    Reply
    Share ›
    −
    +
    alexirpan Mod gondolier 3 years ago
    Oh okay, I see your concerns. I think you still don't need to worry about this because we're currently at the optimal f for a given \theta. At this maxima we have df/d\theta = 0, so the derivative of the first term goes to 0.
    I'm not sure my math skills are up to par with making this argument rigorous, so I will at least say this: I believe the math works out, I may be wrong, and if the math doesn't work out, I don't think the effect will be important empirically. It likely matters much less than the approximation error you get from not optimizing over all 1-Lipschitz functions.
    see more
    0
    Reply
    Share ›
    Show more replies
    Show more replies
  Show more replies
Show more replies
- −
- +
cuteboy 3 years ago
http://papers.nips.cc/paper...
see more
0
Reply
Share ›
Show more replies
- −
- +
Alex Coventry 3 years ago
If anyone in Boston is interested in discussing this paper further, we'll be doing a walkthrough of its implementation this Tuesday evening. We've discussed the theory during the two preceding meetings. https://www.meetup.com/Camb...
see more
0
Reply
Share ›
- - −
  - +
  Larry Guo Alex Coventry 3 years ago
  Could you kindly take video and upload it to Youtube ?
  Groups from Taiwan, appreciated your article and really hope to see your walkthrouhgh from the net. Larry
  see more
  0
  Reply
  Share ›
  - −
    +
    洪語祥 Larry Guo 3 years ago
    So, is there any video recording? So sad for not being there.
    see more
    0
    Reply
    Share ›
    Show more replies
  Show more replies
Show more replies
- −
- +
gwern 3 years ago
I wondered about the connection to actor-critic too; GANs have taken inspiration from RL but so far they haven't given anything back, and offhand, I don't know of anything like clipping in actor-critic, but my thought was that it was the *critic* which should be clipped, not the actor. The critic seems exactly analogous to the discriminator in GAN, as it tries to judge the quality of the action taken by the generator (image emitted). So perhaps the key experiment here would be to add clipping to critic weights and see if it reduces the variance and the system as a whole learns faster?
I also wonder about the scale; with WGAN, the Wasserstein distance and
losses can change dramatically depending on the exact model structure,
and you seem to need to adjust the learning rate drastically (is that
the implication of your mention of the constant being buried in alpha?
I've mentioned this elsewhere that WGAN seems to need aggressive
tweaking of the learning rate, but so far no one else has mentioned it). One of the key ingredients is letting the loss vary over a wider range rather than logging it or whatever; what might the equivalent be for actor-critic?
see more
0
Reply
Share ›
- - −
  - +
  alexirpan Mod gwern 3 years ago
  Yes, I meant clipping should be added to the critic.
  The objective you're taking gradients on for the generator update is K * Wasserstein distance, The gradient update you get is alpha * K * grad_theta (Wasserstein distance), so yeah, I imagine learning rate needs to be carefully re-tuned whenever you change the model or c. If you imagine a fixed generator architecture, there's some optimal alpha * K you want, so whenever K changes alpha must change too. (Another argument for why estimating K would be cool - it would let you speed up the hyperparam search over learning rate.)
  I'm not sure the log vs no log matters in the actor-critic setting - the thing that seems important is the gradients of the critic. It just so happens that using the log makes the gradient saturate more often (as argued in the paper.)
  see more
  0
  Reply
  Share ›
  Show more replies
Show more replies

in this conversation

Log in with

or sign up with Disqus or pick a name

Disqus is a discussion network

Also on Sorta Insightful

One Year Later

Three Years Later

Blog Posts and Research Papers

On The Perils of Batch Norm

Discussion Recommended!

in this conversation

Log in with

or sign up with Disqus or pick a name

Disqus is a discussion network