Inducing self-NSFW classification in image models to prevent deepfakes edits
via news.ycombinator.com
Short excerpt below. Read at the original source.
Hey guys, I was playing around with adversarial perturbations on image generation to see how much distortion it actually takes to stop models from generating or to push them off-target. That mostly went nowhere, which wasn’t surprising. Then I tried something a bit weirder: instead of fighting the model, I tried pushing it to classify […]