[1] Greenblatt, R., et al. (2024). Alignment faking in large language models. arXiv preprint arXiv:2412.14093. Retrieved from https://arxiv.org/abs/2412.14093
[2] Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., & Hobbhahn, M. (2024). Frontier models are capable of in-context scheming. arXiv preprint arXiv:2412.04984. https://doi.org/10.48550/arXiv.2412.04984
[6] Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073. Retrieved from https://arxiv.org/abs/2212.08073
[7] Hubinger, E., et al. (2024). Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. arXiv preprint arXiv:2401.05566. https://arxiv.org/pdf/2401.05566
[8] Geiping, J., et al. (2025). Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171. Retrieved from http://arxiv.org/abs/2502.05171
[9] Mitchell, M., Ghosh, A., Luccioni, A. S., & Pistilli, G. (2025). Fully autonomous AI agents should not be developed. arXiv preprint arXiv:2502.02649. Retrieved from https://arxiv.org/abs/2502.02649.
[10] Pan, X., Dai, J., Fan, Y., & Yang, M. (2024). Frontier AI systems have surpassed the self-replicating red line. arXiv preprint arXiv:2412.12140. https://doi.org/10.48550/arXiv.2412.12140
[14] Hausenloy, J., Miotti, A., & Dennis, C. (2023). Multinational AGI Consortium (MAGIC): A Proposal for International Coordination on AI. arXiv:2310.09217. https://arxiv.org/abs/2310.09217
[15] Aasen, D., Aghaee, M., Alam, Z., Andrzejczuk, M., Antipov, A., Astafev, M., ... Mei, A. R. (2025). Roadmap to fault tolerant quantum computation using topological qubit arrays. arXiv. https://doi.org/10.48550/arXiv.2502.12252
[16] Sarkar, B., Xia, W., Liu, C. K., & Sadigh, D. (2025). Training language models for social deduction with multi-agent reinforcement learning. In Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025), Detroit, Michigan, USA. IFAAMAS. https://arxiv.org/abs/2502.06060