References - AI Safety Framework

[1] Greenblatt, R., et al. (2024). Alignment faking in large language models. arXiv preprint arXiv:2412.14093. Retrieved from https://arxiv.org/abs/2412.14093
[2] Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., & Hobbhahn, M. (2024). Frontier models are capable of in-context scheming. arXiv preprint arXiv:2412.04984. https://doi.org/10.48550/arXiv.2412.04984
[3] OpenAI. (2024). OpenAI o1 System Card. https://arxiv.org/abs/2412.16720
[4] OpenAI. (2025). OpenAI o3-mini System Card. https://cdn.openai.com/o3-mini-system-card.pdf
[5] Alignment Science Team. (2025). Recommendations for technical AI safety research directions. Anthropic Alignment Blog. https://alignment.anthropic.com/2025/recommended-directions
[6] Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073. Retrieved from https://arxiv.org/abs/2212.08073
[7] Hubinger, E., et al. (2024). Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. arXiv preprint arXiv:2401.05566. https://arxiv.org/pdf/2401.05566
[8] Geiping, J., et al. (2025). Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171. Retrieved from http://arxiv.org/abs/2502.05171
[9] Mitchell, M., Ghosh, A., Luccioni, A. S., & Pistilli, G. (2025). Fully autonomous AI agents should not be developed. arXiv preprint arXiv:2502.02649. Retrieved from https://arxiv.org/abs/2502.02649.
[10] Pan, X., Dai, J., Fan, Y., & Yang, M. (2024). Frontier AI systems have surpassed the self-replicating red line. arXiv preprint arXiv:2412.12140. https://doi.org/10.48550/arXiv.2412.12140
[11] OpenAI et al. (2025). Competitive Programming with Large Reasoning Models. arXiv. https://doi.org/10.48550/arXiv.2502.06807
[12] Li, A., Zhou, Y., Raghuram, V. C., Goldstein, T., & Goldblum, M. (2025). Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks. arXiv:2502.08586. https://arxiv.org/abs/2502.08586
[13] Leahy, C., Alfour, G., Scammell, C., Miotti, A., & Shimi, A. (2024). The Compendium (V1.3.1). [Living document]. Retrieved from https://pdf.thecompendium.ai/the_compendium.pdf
[14] Hausenloy, J., Miotti, A., & Dennis, C. (2023). Multinational AGI Consortium (MAGIC): A Proposal for International Coordination on AI. arXiv:2310.09217. https://arxiv.org/abs/2310.09217
[15] Aasen, D., Aghaee, M., Alam, Z., Andrzejczuk, M., Antipov, A., Astafev, M., ... Mei, A. R. (2025). Roadmap to fault tolerant quantum computation using topological qubit arrays. arXiv. https://doi.org/10.48550/arXiv.2502.12252
[16] Sarkar, B., Xia, W., Liu, C. K., & Sadigh, D. (2025). Training language models for social deduction with multi-agent reinforcement learning. In Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025), Detroit, Michigan, USA. IFAAMAS. https://arxiv.org/abs/2502.06060
[17] Anthropic. (2025, February 24). Claude 3.7 Sonnet System Card. Anthropic. https://www.anthropic.com/claude-3-7-sonnet-system-card
[18] TIME. (2024, December 18). Exclusive: New Research Shows AI Strategically Lying. Retrieved from https://time.com/7202784/ai-research-strategic-lying
[19] Vijayan, J. (2023, December 5). LLMs Open to Manipulation Using Doctored Images, Audio. Dark Reading. Retrieved from https://www.darkreading.com/vulnerabilities-threats/llms-open-manipulation-using-doctored-images-audio