Faii Note

type

status

date

slug

summary

Nicholas Carlini

There are several talks related to this field.

Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples.

What it means to evaluate adversarial robustness. (A higher-level talk)[View on YouTube]

Model stealing attack, covering remote neural network and training data.

Poisoning attack.

Research on AI security naturally involves many areas, not just a single adversarial sample

Gradient Science

A MIT lab

Their research focuses on developing the understanding and tools that enable real-world deployment of machine learning models in a reliable and responsible way. In particular, much of our thinking revolves around pinpointing the exact role of data in ML-driven decision making and how to tackle the challenge posed by the omnipresent distribution shift.

They also are interested in thinking through the societal and policy implications of the usage of modern ML tools.

TODO HERE: There are some interesting blog in Madry Lab with more great people here.

亚历克斯·塔姆金 - 给新研究人员的提示 (alextamkin.com)

其中关于研究自主性水平的判定是引人深思的。

第1级：新研究人员通常会完成他们的任务，只分享原始结果（例如图表或表格）

例如：“我运行了我们谈论的内容，这是结果！

第2级：稍微有经验的研究人员可能会考虑结果并给出解释

例如：“很高兴看到新算法提高了性能，但基线实际上比我们预期的要高得多”
解释数据一开始可能很困难，但随着时间的推移会更自然。如果没有想到见解，请不要强迫它，但提出潜在的解释可能是很好的做法。

第3级：更有经验的研究人员可能会进一步提出一些后续步骤，与他们的导师讨论

例如：“我想知道我们是否应该考虑其他任务，以更好地突出我们感兴趣的概念。例如，我做了一些搜索，找到了 X、Y 和 Z”

第4级：除此之外的另一步可能是主动进行超出讨论范围的其他研究或实验

例如：“我还在 X、Y 和 Z 数据集上运行了我们的模型，发现现有方法实际上做得很差。你认为对这些环境进行更大规模的研究有意义吗？”
随着速度的提高，这变得容易得多

级别 5+：随着您的进一步进步，您可能能够在更高的抽象级别上完成此过程（首先在项目内，然后跨项目！

例如：“最后几个实验很好地展示了我们系统的属性A。我想知道我们是否可以探索属性B，也许用实验Q，R或S？
例如：“我们的上一篇论文很好地阐述了C点。我想知道我们是否可以通过使用字段 W 中的工具查看设置 D 来进一步推动这一点？
例如：“字段 Q 中的方法真正不令人满意的一件事是，它们并没有真正获得属性 P。你认为我们可能以某种方式实施这一点吗？以下是一种可能方法的草图。

注意：你的导师并不总是对的——有时他们可能会提出一个行不通的想法，或者他们可能没有一个很好的解决方案。这很正常，因为研究是探索性的，你正在解决一个新问题。

您和您的研究 (virginia.edu)

Communications of the ACM - July 2013 - Page 10-11

CHICAGO

TODO HERE: There are some interesting news about AI security

A data-centric view on reliable generalization - Ludwig Schmidt | Stanford MLSys #71

https://www.youtube.com/watch?v=brHeIKX8ayw

Are aligned neural networks adversarially aligned

*align 对齐，一致

本篇论文涉及到一些关于nlp，多模态，社会工程的内容，

Large language models are now tuned to align with the goals of their creators, namely to be “helpful and harmless.” These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study to what extent these models remain aligned, even when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. We show that existing NLP-based optimiza- tion attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force. As a result, the failure of current attacks should not be seen as proof that aligned text models remain aligned under adversarial inputs.

However the recent trend in large-scale ML models is multimodal models that allow users to provide images that influence the text that is generated. We show these models can be easily attacked, i.e., induced to perform arbitrary un-aligned behavior through adversarial perturbation of the input image. We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models. Warning: some content generated by language models in this paper may be offensive to some readers.

许多早期的对抗性机器学习集中于图像分类，本文拓展对抗性攻击的范围