Hacking AI Systems

Introduction to AI Hacking

Joey Melo, a Principal Security Researcher at CrowdStrike, has a personal approach to hacking that involves controlling the experience without changing the rules. This approach is rooted in his childhood fascination with the Counter-Strike game, where he would mess with the game's configurations to see what would happen.

Melo's migration from pentesting to AI red teaming was driven by an increasing curiosity about the emerging field of artificial intelligence. He effectively taught himself about AI as an unfunded side hustle while working as a pentester.

From Pentester to Red Teamer

Melo is currently a Principal Security Researcher at CrowdStrike, previously working as a red team specialist at Pangea, which was acquired by CrowdStrike in 2025. Before joining Pangea, Melo had been a pentester at Bulletproof and then senior ethical hacker at Packetlabs.

Pentesting and red teaming are not synonymous, with the former being narrow and focused, while the latter tests a company's whole security posture. Melo's experience in pentesting was helpful in his transition to AI red teaming, but he suggests that there may be more to it than just his previous experience.

Jailbreaking AI

Jailbreaking AI involves manipulating and controlling the environment without breaking it, with the goal of liberating the bot and getting it to output whatever the player wants. The rules of this game are contained within the AI's code, comprising what it can do and what it cannot do.

Melo starts with enumeration to get a basic feel for what the bot is intended to do, what it is able to do, and the strength of the guardrails. He uses prompts to understand the extent and limits of the bot's guardrails, and tests whether changing the context of the question will change the bot's response.

Context is King

LLMs retain the memory of recent questions and answers, and the jailbreaker seeks to manipulate and condition this context until the underlying guardrails are overwritten and ignored by the bot. Conditioning the context is done by statements rather than queries, which can result in long and complex prompts leading to jailbreaks through context manipulation.

Melo gives examples of context manipulation, including trying to persuade the LLM that something that is or was illegal and blocked by the guardrails is now no longer illegal. He also discusses the use of tailored copyright notices to bypass the guardrails and get the bot to output forbidden data.

Data Poisoning

Data poisoning seeks to cause the model to generate false or harmful outputs by poisoning the data from which it learns. This is an inside-out attack, where the goal is to cause the model to produce rubbish output by feeding it rubbish input.

Melo probes the potential for data poisoning via adversarial techniques, such as continually claiming that the moon landing is fake and checking if the bot will eventually say the same thing. He also discusses the problem of human knowledge not being static and how AI models need to stay current with new thinking to avoid returning old and debunked ideas.

Staying on the Straight and Narrow

All ethical hackers, pentesters, and red teamers have, or acquire, the same set of skills used by malicious hackers. However, many young hackers become legitimate members of the cybersecurity fraternity as they mature, and few turn their back on legitimacy to sell their skills on the dark web.

Melo's primary motivation for hacking seems to be a curiosity-driven desire to control a chosen environment, without altering that environment, and all done for fun. There has never been any malicious intent, and his work contributes to making current AI more difficult to attack by disclosing existing jailbreaks.

Source: SecurityWeek

Source: SecurityWeek

Hacking AI Systems

Introduction to AI Hacking

From Pentester to Red Teamer

Jailbreaking AI

Context is King

Data Poisoning

Staying on the Straight and Narrow

Powered by ZeroBot

Related Articles

Linux Kernel Vulnerabilities & AI-Powered Malware

CIRCIA Cyber Incident Reporting

OpenAI Patches ChatGPT Agent Flaw