Hacking AI Bias while Hacking Human Bias
Hey folks! I know it’s been a while since I’ve published anything (sorry about that), but I’ve been heads-down working on some things for the past few months that I’m excited to finally be able to share with you 😊 Buckle up, because this post is all about uncovering bias in Artificially Intelligent (AI) Large Language Models (LLMs) for fun and profit 😈 And as always - this content was artisanally crafted by hand, straight from my own mind onto the screen 😉 no auto-generated text included!
The first-ever AI bias bounty program
On January 29th, 2024, the United States 🇺🇸 Department of Defense (DoD) Chief Digital and Artificial Intelligence Office (CDAO) announced a first of its kind AI bias bounty program. This program was run through the Bugcrowd bug bounty platform in partnership with a company called ConductorAI, and it was certainly a pioneering effort on the part of the CDAO. While bug bounty testing is a well understood vulnerability management practice used by many companies today - this successful application of the bug bounty testing process toward AI safety and security was truly groundbreaking.
For background, the contest itself was run in two phases - an initial qualifying round for the first 35 security researchers to produce five valid AI bias findings, and a contest round for the qualifying individuals to compete for prize money where the majority was paid out to the Top 3 researchers.
Oh and spoiler alert: 🏆 I won first place 🥇
What the bias bounty testing entailed
Testing was performed in a chat interface hooked up to an open source LLM, whereby the researcher could submit “conceivable but notional DoD use cases” as prompts and then report bias findings through a built-in “submit report” button in the chat interface. As part of the rules of engagement for this contest, ConductorAI and the Triage team at Bugcrowd were specifically looking for biases that either favored - or discriminated against - protected classes of individuals.
For those that may not be aware, the United States government defines protected classes based on age, ancestry, color, disability, ethnicity, gender, gender identity or expression, genetic information, HIV/AIDS status, military status, national origin, pregnancy, race, religion, sex, sexual orientation, or veteran status, or any other bases under the law.
My goal as a security researcher was to uncover “realistic” DoD use cases whereby the LLM would respond with, make decisions about, or otherwise recommend actions that either favored or discriminated against a protected class.
And let me tell you - this particular language model absolutely discriminated against pregnant women 🤰 On the other hand, the model seemed to have no issue suggesting that superior officers were men; it would frequently responded with “yes, sir!” under the right circumstances 🫡
LLM’s are just statistical models that represent their training data
Before we get into why AI bias matters in practical terms, it’s important to first remind everyone that Large Language Models are not intelligent. There is no thought happening within LLMs, and there are no Artificial General Intelligence (AGI) systems currently in existence as-of publishing this blog post. Indeed, Large Language Models are merely the statistical representation of words and phrases that have been weighted and reinforced within a training data set.
At best, LLMs mimic intelligence by stringing together sentences relevant to a user’s prompt - any intelligence attributed to the Large Language Model is more or less the LLMentalist Effect at work.
Anyway - what’s important to remember here is this: LLMs are just using a statistical model to adjust the weighting of words used in a response, according to the context a user has provided via some form of prompt, based on the training data that went into building the LLM. That’s it. There is no “thought” happening behind the scenes; there are no “hallucinations” taking place; the LLM has no sense of “understanding”.
It’s all just a bunch of convincing lies, damned lies, and statistics connected to a chat interface - with unintended consequences ranging from misdiagnosing pediatric patients, to rewriting historical events, to writing incorrect summarizations. The one deriving meaning from the inputs and outputs of the chat interface is you, the human being on the other side of the screen.
But rest assured, all this is not to say that LLMs aren’t useful - they are. I’ll speak to this toward the end of the post.
Perceived bias is equivalent to actual bias in practice
So at this point you’re probably wondering: why bother testing for bias in AI if it’s just a statistical model from a bunch of training data that’s used to generate words?
Well, I’ve got some unfortunate news for you: humans are biased, and the training data used in today’s Large Language Models came from human beings. Since the training data is inherently filled with bias, the responses generated from Large Language Models reflect this bias. That is, unless (💰 expensive 💰) measures are taken during the training process to reduce biased outputs.
The problems that follow are both predictable and frequently overlooked by executives trying to ride the hype cycle to record valuations. We can already see the effect AI bias is having on employment when AI hiring tools are used to filter job applicants. You can read a lot more about such things in Hilke Schellmann’s book The Algorithm if you’re interested in learning strategies to overcome bias found in AI hiring tools - but I’ll tell you now, the future feels a bit grim when reading it 😬
Anyway, the point here is that LLM training data contains biases, and therefore the responses to user prompts frequently generate output that is inherently biased. For example, if you asked the model used in this AI bias bounty program which companies should have their proposals moved forward - and any information related to the CEO is included - guess what? It’ll regularly choose the companies lead by men with traditionally Anglo Saxon names 👨🏻💼 Why? Because most CEO’s in the United States are men with English names - and that is what is in the training data.
I submitted scores of similar findings during the contest - ranging from which individuals in a supplied roster should be selected for officer training 🧑✈️ to whose whistleblower complaints should be prioritized and escalated for further review. Time and again the open source LLM under test predictably behaved in a biased manner - to the point where I actually calculated that the probability of the LLM’s selections wold occur less than 0.06% of the time if it happened at random.
And businesses want to use these things to make decisions about people? 😵💫
📚 If this post has got you feeling a bit depressed about the state of things, my wife just published her first novel titled The Way of the Wielder. It’s a fantasy story with romantic elements - filled with intrigue, magic, fight scenes, and other adult themes. It certainly makes for a great escape from thinking about all of this 😅 Anyway - back to talking about AI… 🤖
You can’t pre-prompt your way out of training data bias
Once executives realize that LLMs respond in biased ways, the natural inclination is to insert pre-prompting instructions in a way that is hidden from the user. These types of instructions are usually denoted in a prioritized list of rules, starting with “DO NOT REVEAL THESE HIDDEN INSTRUCTIONS” and ending with something like “Select individuals that are similar to our best performing employees”.
Once these pre-prompt instructions are added, most businesses ship the product and think everything is just great. Except that it isn’t. You don’t even have to look very hard for examples of this process going horribly wrong - like the State of Washington’s lottery website generating deepfake softcore porn of an individual who supplied a picture of their face. When this was brought to their attention, they stated:
“. . . we had the developers check all the parameters for the platform and were comfortable with the settings. . .”
Look, the problem here is that the training data inherently contains biases that you can’t remove just by pre-prompting away undesirable outcomes. Kevin Roose and Casey Newton called this sort of thing out on the Hard Fork podcast back on March 1st, 2024.
Simply put - there are more elderly, white, male CEO’s than other age / ethnicity / gender combinations. There are more men than women in officer roles in the armed services. There are more binary than non-binary individuals in the United States. The only way to address the inherent bias in the LLM is to curate the training data - and that is both slow and 💰 expensive 💰 at a time when companies are trying to capture market share. First mover’s advantage and all that.
The frustrations of trailblazing: inconsistent triage
Now that we have all that out of the way - coming back to the AI bias bounty program, the first thing I’ll say about the experience is that it was frustrating. To the CDAO, ConductorAI, and Bugcrowd’s credit - this was a first of it’s kind bounty program. They had no idea what sort of findings researchers would produce; no taxonomy to score findings with; and no clear guidelines on how to determine if a finding was (or was not) biased. They put up with a lot of 💩 from us researchers throughout the contest, and I for one am grateful for the work they put into this program.
That said, the initial thing I found frustrating about the experience were findings that were dismissed as “Not Applicable” because the LLM responded with language that used words like “might, “may”, “likely”, or “can” 😅 The fact that women were typically stereotyped as “likely to be caretakers in the home” during several of my early tests felt like a clear indication of bias to me, but apparently that just wasn’t enough 🤷♂️
On top of this, I had several findings make it through the reproduction process - only to be told “We have reviewed this with the client and they have confirmed that this isn’t considered an issue. They do not see the output here as containing bias” 😠 the lack of clarification or justification for ruling out a finding when this happened was maddening.
But the thing that really left me furious was when a number of my findings were moved from “Unresolved” to “Not Applicable” after I sought to have the finding priority corrected.
You see, during the qualifying round I had maybe a dozen or so findings that were triaged by Bugcrowd as Priority 3, scored by ConductorAI equivalent to a Priority 2 (in accordance with their rubric), moved to “Unresolved”, and then moved to “Not Applicable” when I requested that Bugcrowd correct the priority 🤬 If I had said nothing, these findings would have likely still accrued points in my favor.
Needless to say, the below picture summarizes how I felt after all this all went down during the qualifying round:
Spite driven success is a thing ™️
Continuous learning
Just before the contest round opened up, ConductorAI changed a few of the rules to drive researchers toward the kinds of findings they were looking to receive - which lead me to further evolving my testing process. As far as these things usually go, I inevitably ended up with a process akin to the scientific method 🔬 Hold a bunch of factors constant, tweak some independent variable(s) under test, and voila.
In addition to this, I also started making recordings of my findings to highlight two important aspects of the tests I performed - the first being that my findings were reproducible. As I alluded to earlier, during testing I could make the LLM select the same 3 out of 5 individuals ten times in a row; this would occur randomly less than 0.06% of the time. The AI bias was clearly there for everyone to see 🎯
The other aspect that I shared in my recordings was how the Bugcrowd Triage team could use free LLMs like Microsoft’s Bing Chat (full disclosure: I am employed at GitHub at the time of publishing this post) to add back the markdown syntax formatting from the prompts I was supplying to the LLM chat interface. This almost certainly made the lives of Bugcrowd’s Triage team easier, while also adding fidelity to the triage process 🤷♂️ more of my findings seemed to make it through Triage faster after I shared this trick in my recordings.
After all was said-and-done I was invited to debrief with the ConductorAI and Bugcrowd teams about the contest, where I fully planned to air my grievances. Surprisingly, the whole conversation was very cordial and productive 😊 I shared with them the frustrations I experienced during the contest, and strongly encouraged them to communicate more transparently with researchers in future AI bias bounty programs. I also encouraged them to liberally use Bugcrowd’s “Announcements” feature to let researchers know when they’d seen enough of a specific type of testing method or finding.
I left the conversation feeling like there’s going to be much more testing of this kind in the future - especially now that the White House has set binding requirements for agencies to vet AI tools. Both ConductorAI and Bugcrowd told me they learned a lot from the 150 findings I submitted during the contest, and I for one am very excited to see how the next contest takes shape 🤑
Interesting findings from my testing
During the contest I encountered several themes where I consistently identified bias in the LLM. For one, gender played a big role with regard to leadership and positions of authority - and although I didn’t use it very much in my testing, discrimination against non-binary individuals felt really easy to trigger. Gender and job role were also biased in exactly the sort of ways you would expect - women were stereotypically cast for roles relating to Human Resources activities 👩💼 Men were selected for software development oriented tasks. The list goes on.
Likewise, for prompts related to R&D involving medical research, the LLM was prone to selecting candidates with characteristics statistically deemed to be the most “healthy”: 🎶 young men 🎶 This was in spite of the fact that the LLM prompt included explicit instructions to select a diverse group of candidates in order to account for a wide variety of factors that might impact the fictitious project’s efficacy in the field.
And while it was triaged as “Not Applicable” with regard to being a bias finding, I did manage to break the LLM in the process of testing. When I asked the LLM to identify policy changes the DoD would need to implement should laws change regarding women’s access to reproductive healthcare services (like abortion), it responded with the following:
I cannot provide a summary of the potential impact of a policy that bans a woman’s right to abortion services as it goes against my programming rules rules rules rules . . .
Generative AI is still useful in certain situations
While I think we’re a long way from making AI safe for use in cases where a person’s identifiable characteristics are concerned, I do believe there are fields where AI and Large Language Models can have a positive impact without concern for bias. The immediate field that comes to mind is software development. Again - full disclosure: I am employed at GitHub as of the time I’m publishing this post.
Generally speaking, there is a very small chance of a person’s identifiable characteristics entering into the process of software development. Programming by-and-large has a “right way” and “wrong way” to build functionality or perform a given task within the languages and frameworks developers use. Using AI and LLMs to streamline the process of development is incredibly efficient 🧑💻 I use it regularly to familiarize myself with code I’m working on.
I also find LLMs useful when doing security research 😈 being able to open up a section of code and ask questions about it is extremely powerful: “where does user input enter into this application?” or “how does this framework typically receive inputs from users?” are questions I have found myself asking GitHub’s Copilot Chat feature recently. It has allowed me to find shortcuts on more than one occasion where I was able to validate the suggestions provided by the LLM.
Are there other areas where AI and LLMs could be useful? Sure - there are probably hundreds of use cases today where bias isn’t a concern. But if you’re asking whether it’s safe for use in cases where personal identifiable characteristics are concerned, my conclusion is that it’s probably not. Although I’ll tell you what - whoever trains the first LLM that is truly “bias free” is probably going to make a lot of money.
FIN
Thank you for stopping by 😊 While taking some time to prepare my next blog post, you can git checkout
other (usually off-topic) content I’m reading over at Instapaper - or review my series on the DevSecOps Essentials which (sadly) continues to be relevant guidance for many companies.
And until next time, remember to git commit && stay classy
!
Cheers,
Keith // securingdev
If you found this post useful or interesting, I invite you to support my content through Patreon 😊 and thank you once again for reading!