This was really helpful. Thanks!
We were unable to load Disqus. If you are a moderator please see our troubleshooting guide.
Scott Le Grand — HEY! This is Halloween, not April Fool's! It's bad enough that Christmas decorations are already up! Now we're skipping ahead to 2019? Just say NO!
azb — Machine Learning is just following vector fields - e.g. Maths 101 and Bayesian statistics is just calculus.
azb — I'm actually quite surprised they did not try the Laplace operator directly on the toy dataset at least... people are getting lazy or something. Even if it is expensive at least one can measure difference between that and their square gradient approximation. As most of the results for the actual gradient regularizer otherwise require that the function is sufficiently close to the epsilon-solution, which we all know is absolutely never true during training, most likely even at convergence as running only 5 steps for fixed generator is not going to do much on this.
cc ss — based on this line of code:lam = numpy.random.beta(alpha+1, alpha, batchsize)
I think what you are doing indeed is not mixing up labels, but you are with certain probability randomly changing the label of the same mixuped sample. For example, for a mixuped sample ax+(1-a)y, you give it with label_x with probability p, label_y with probability (1-p); meanwhile, you give ay + (1-a)x label_y with probability p, label_x with probability (1-p); as you known, x and y are interchangable, so, in long term expectation, the expected label actually is p label_x + (1-p)label_y. The main difference between your blog and the paper is the way of understanding, you explain it with randomly changing label, and paper shows it with mixup labels.