Inducing self-NSFW classification in image models to prevent deepfakes edits

via news.ycombinator.com

Short excerpt below. Read at the original source.

Hey guys, I was playing around with adversarial perturbations on image generation to see how much distortion it actually takes to stop models from generating or to push them off-target. That mostly went nowhere, which wasn’t surprising. Then I tried something a bit weirder: instead of fighting the model, I tried pushing it to classify […]

Read at Source