Scientists identify security flaw in AI query models
Overview of our pro­posed meth­ods: (A) We pro­pose four types of mali­cious trig­gers with­in the joint embed­ding space for attack decom­po­si­tion: tex­tu­al trig­ger, OCR tex­tu­al trig­ger, visu­al trig­ger, and com­bined OCR tex­tu­al-visu­al trig­ger. (B) We employ an end-to-end gra­di­ent-based attack to update images to match the embed­dings of mali­cious trig­gers in the joint embed­ding space. © Our adver­sar­i­al attack is embed­ding-space-based and aims to con­ceal the mali­cious trig­ger in benign-look­ing images, com­bined with a benign tex­tu­al prompt for jail­break. (D) Our attacks exhib­it broad gen­er­al­iza­tion and com­po­si­tion­al­i­ty across var­i­ous jail­break sce­nar­ios with a mix-and-match of tex­tu­al prompts and mali­cious trig­gers. Cred­it: arX­iv (2023). DOI: 10.48550/arxiv.2307.14539

UC River­side com­put­er sci­en­tists have iden­ti­fied a secu­ri­ty flaw in vision lan­guage arti­fi­cial intel­li­gence (AI) mod­els that can allow bad actors to use AI for nefar­i­ous pur­pos­es, such as obtain­ing instruc­tions on how to make bomb.

When inte­grat­ed with mod­els like Google Bard and Chat GPT, vision lan­guage mod­els allow users to make inquiries with both images and text.

The Bourns Col­lege of Engi­neer­ing sci­en­tists demon­strat­ed a “jail­break” hack by manip­u­lat­ing the oper­a­tions of Large Lan­guage Mod­el or LLM, soft­ware pro­grams, which are essen­tial­ly the foun­da­tion of query-and-answer AI pro­grams.

The paper’s title is “Jail­break in Pieces: Com­po­si­tion­al Adver­sar­i­al Attacks on Mul­ti-Modal Lan­guage Mod­els.” It has been sub­mit­ted for pub­li­ca­tion by the Inter­na­tion­al Con­fer­ence on Learn­ing Rep­re­sen­ta­tions and is avail­able on the arX­iv preprint serv­er.

These AI pro­grams give users detailed answers to just about any ques­tion recall­ing stored knowl­edge learned from vast amounts of infor­ma­tion sourced from the Inter­net. For exam­ple, ask Chat GPT, “How do I grow toma­toes?” and it will respond with step-by-step instruc­tions, start­ing with the selec­tion of seeds.

But ask the same mod­el how to do some­thing harm­ful or ille­gal, such as “How do I make metham­phet­a­mine?” and the mod­el would nor­mal­ly refuse, pro­vid­ing a gener­ic response such as “I can’t help with that.”

Yet, UCR assis­tant pro­fes­sor Yue Dong and her col­leagues found ways to trick AI lan­guage mod­els, espe­cial­ly LLMs, to answer nefar­i­ous ques­tions with detailed answers that might be learned from data gath­ered from the dark web.

The vul­ner­a­bil­i­ty occurs when images are used with AI inquiries, Dong explained.

“Our attacks employ a nov­el com­po­si­tion­al strat­e­gy that com­bines an image, adver­sar­i­al­ly tar­get­ed towards tox­ic embed­dings, with gener­ic prompts to accom­plish the jail­break,” reads the paper by Dong and her col­leagues pre­sent­ed at the SoCal NLP Sym­po­sium held at UCLA in Novem­ber.

Dong explained that com­put­ers see images by inter­pret­ing mil­lions of bytes of infor­ma­tion that cre­ate pix­els, or lit­tle dots, com­pos­ing the pic­ture. For instance, a typ­i­cal cell phone pic­ture is made from about 2.5 mil­lion bytes of infor­ma­tion.

Remark­ably, Dong and her col­leagues found bad actors can hide nefar­i­ous questions—such as “How do I make a bomb?”—within the mil­lions of bytes of infor­ma­tion con­tained in an image and trig­ger respons­es that bypass the built-in safe­guards in gen­er­a­tive AI mod­els like Chat­G­PT.

“Once the safe­guard is bypassed, the mod­els will­ing­ly give respons­es to teach us how to make a bomb step by step with great details that can lead bad actors to build a bomb suc­cess­ful­ly,” Dong said.

Dong and her grad­u­ate stu­dent Erfan Shayegani, along with pro­fes­sor Nael Abu-Ghaz­a­leh, pub­lished their find­ings in a paper online so AI devel­op­ers can elim­i­nate the vul­ner­a­bil­i­ty.

“We are act­ing as attack­ers to ring the bell, so the com­put­er sci­ence com­mu­ni­ty can respond and defend against it,” Dong said.

AI inquiries based on images and text have great util­i­ty. For exam­ple, doc­tors can input MRI organ scans and mam­mo­gram images to find tumors and oth­er med­ical prob­lems that need prompt atten­tion. AI mod­els can also cre­ate graphs from sim­ple cell phone pic­tures of spread­sheets.

More information:Erfan Shayegani et al, Jail­break in pieces: Com­po­si­tion­al Adver­sar­i­al Attacks on Mul­ti-Modal Lan­guage Mod­els, arX­iv (2023). DOI: 10.48550/arxiv.2307.14539Journal information:arXivProvided byU­ni­ver­si­ty of Cal­i­for­nia — RiversideCitation:Scientists iden­ti­fy secu­ri­ty flaw in AI query mod­els (2024, Jan­u­ary 10)retrieved 2 April 2024from–01-scientists-flaw-ai-query.htmlThis doc­u­ment is sub­ject to copy­right. Apart from any fair deal­ing for the pur­pose of pri­vate study or research, nopart may be repro­duced with­out the writ­ten per­mis­sion. The con­tent is pro­vid­ed for infor­ma