Measuring Massive Multitask Language Understanding D Hendrycks, C Burns, S Basart, A Zou, M Mazeika, D Song, J Steinhardt ICLR, 2020 | 1106 | 2020 |
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models A Srivastava, A Rastogi, A Rao, AAM Shoeb, A Abid, A Fisch, AR Brown, ... TMLR, 2022 | 732 | 2022 |
Universal and Transferable Adversarial Attacks on Aligned Language Models A Zou, Z Wang, N Carlini, N Milad, JZ Kolter, M Fredrikson arXiv preprint arXiv:2307.15043, 2023 | 328 | 2023 |
Scaling Out-of-Distribution Detection for Real-World Settings D Hendrycks, S Basart, M Mazeika, A Zou, J Kwon, M Mostajabi, ... ICML, 2021 | 305 | 2021 |
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures D Hendrycks, A Zou, M Mazeika, L Tang, D Song, J Steinhardt CVPR, 2021 | 87 | 2021 |
Representation Engineering: A Top-Down Approach to AI Transparency A Zou, L Phan, S Chen, J Campbell, P Guo, R Ren, A Pan, X Yin, ... arXiv preprint arXiv:2310.01405, 2023 | 63 | 2023 |
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark A Pan, CJ Shern, A Zou, N Li, S Basart, T Woodside, J Ng, H Zhang, ... ICML, 2023 | 62 | 2023 |
What Would Jiminy Cricket Do? Towards Agents That Behave Morally M Mazeika, A Zou, S Patel, C Zhu, J Navarro, D Song, B Li, J Steinhardt, ... NeurIPS, 2021 | 45* | 2021 |
Forecasting Future World Events with Neural Networks A Zou, T Xiao, R Jia, J Kwon, M Mazeika, R Li, D Song, J Steinhardt, ... NeurIPS, 2022 | 13 | 2022 |
How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios M Mazeika, E Tang, A Zou, S Basart, D Song, D Forsyth, J Steinhardt, ... NeurIPS, 2022 | 8 | 2022 |
Unlocking Deterministic Robustness Certification on ImageNet K Hu, A Zou, Z Wang, K Leino, M Fredrikson NeurIPS, 2023 | 6* | 2023 |
The Trojan Detection Challenge M Mazeika, D Hendrycks, H Li, X Xu, S Hough, A Zou, A Rajabi, Q Yao, ... NeurIPS 2022 Competition Track, 279-291, 2022 | 5 | 2022 |
The wmdp benchmark: Measuring and reducing malicious use with unlearning N Li, A Pan, A Gopal, S Yue, D Berrios, A Gatti, JD Li, AK Dombrowski, ... arXiv preprint arXiv:2403.03218, 2024 | 3 | 2024 |
Harmbench: A standardized evaluation framework for automated red teaming and robust refusal M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu, E Sakhaee, N Li, ... arXiv preprint arXiv:2402.04249, 2024 | 3 | 2024 |
How Hard is Trojan Detection in DNNs? Fooling Detectors With Evasive Trojans M Mazeika, A Zou, A Arora, P Pleskov, D Song, D Hendrycks, B Li, ... | 3 | 2023 |
How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios Supplementary Material M Mazeika, E Tang, A Zou, S Basart, JS Chan, D Song, D Forsyth, ... | | |