AI generates high-quality images 30 times faster in a single step
With their DMD method, MIT researchers cre­at­ed a one-step AI image gen­er­a­tor that achieves image qual­i­ty com­pa­ra­ble to Sta­bleD­if­fu­sion v1.5 while being 30 times faster. Cred­it: Illus­tra­tion by Alex Shipps/MIT CSAIL using six AI-gen­er­at­ed images devel­oped by researchers.

In our cur­rent age of arti­fi­cial intel­li­gence, com­put­ers can gen­er­ate their own “art” by way of dif­fu­sion mod­els, iter­a­tive­ly adding struc­ture to a noisy ini­tial state until a clear image or video emerges.

Dif­fu­sion mod­els have sud­den­ly grabbed a seat at every­one’s table: Enter a few words and expe­ri­ence instan­ta­neous, dopamine-spik­ing dream­scapes at the inter­sec­tion of real­i­ty and fan­ta­sy. Behind the scenes, it involves a com­plex, time-inten­sive process requir­ing numer­ous iter­a­tions for the algo­rithm to per­fect the image.

MIT Com­put­er Sci­ence and Arti­fi­cial Intel­li­gence Lab­o­ra­to­ry (CSAIL) researchers have intro­duced a new frame­work that sim­pli­fies the mul­ti-step process of tra­di­tion­al dif­fu­sion mod­els into a sin­gle step, address­ing pre­vi­ous lim­i­ta­tions. This is done through a type of teacher-stu­dent mod­el: teach­ing a new com­put­er mod­el to mim­ic the behav­ior of more com­pli­cat­ed, orig­i­nal mod­els that gen­er­ate images.

The approach, known as dis­tri­b­u­tion match­ing dis­til­la­tion (DMD), retains the qual­i­ty of the gen­er­at­ed images and allows for much faster gen­er­a­tion.

“Our work is a nov­el method that accel­er­ates cur­rent dif­fu­sion mod­els such as Sta­ble Dif­fu­sion and DALLE‑3 by 30 times,” says Tian­wei Yin, an MIT Ph.D. stu­dent in elec­tri­cal engi­neer­ing and com­put­er sci­ence, CSAIL affil­i­ate and the lead researcher on the DMD frame­work.

“This advance­ment not only sig­nif­i­cant­ly reduces com­pu­ta­tion­al time but also retains, if not sur­pass­es, the qual­i­ty of the gen­er­at­ed visu­al con­tent. The­o­ret­i­cal­ly, the approach mar­ries the prin­ci­ples of gen­er­a­tive adver­sar­i­al net­works (GANs) with those of dif­fu­sion mod­els, achiev­ing visu­al con­tent gen­er­a­tion in a sin­gle step—a stark con­trast to the hun­dred steps of iter­a­tive refine­ment required by cur­rent dif­fu­sion mod­els. It could poten­tial­ly be a new gen­er­a­tive mod­el­ing method that excels in speed and qual­i­ty.”

This sin­gle-step dif­fu­sion mod­el could enhance design tools, enabling quick­er con­tent cre­ation and poten­tial­ly sup­port­ing advance­ments in drug dis­cov­ery and 3D mod­el­ing, where prompt­ness and effi­ca­cy are key.

Distribution dreams

DMD clev­er­ly has two com­po­nents. First, it uses a regres­sion loss, which anchors the map­ping to ensure a coarse orga­ni­za­tion of the space of images to make train­ing more sta­ble.

Next, it uses a dis­tri­b­u­tion match­ing loss, which ensures that the prob­a­bil­i­ty of gen­er­at­ing a giv­en image with the stu­dent mod­el cor­re­sponds to its real-world occur­rence fre­quen­cy. To do this, it lever­ages two dif­fu­sion mod­els that act as guides, help­ing the sys­tem under­stand the dif­fer­ence between real and gen­er­at­ed images and mak­ing train­ing the speedy one-step gen­er­a­tor pos­si­ble.

The sys­tem achieves faster gen­er­a­tion by train­ing a new net­work to min­i­mize the dis­tri­b­u­tion diver­gence between its gen­er­at­ed images and those from the train­ing dataset used by tra­di­tion­al dif­fu­sion mod­els. “Our key insight is to approx­i­mate gra­di­ents that guide the improve­ment of the new mod­el using two dif­fu­sion mod­els,” says Yin.

“In this way, we dis­till the knowl­edge of the orig­i­nal, more com­plex mod­el into the sim­pler, faster one while bypass­ing the noto­ri­ous insta­bil­i­ty and mode col­lapse issues in GANs.”

Yin and col­leagues used pre-trained net­works for the new stu­dent mod­el, sim­pli­fy­ing the process. By copy­ing and fine-tun­ing para­me­ters from the orig­i­nal mod­els, the team achieved fast train­ing con­ver­gence of the new mod­el, which is capa­ble of pro­duc­ing high-qual­i­ty images with the same archi­tec­tur­al foun­da­tion. “This enables com­bin­ing with oth­er sys­tem opti­miza­tions based on the orig­i­nal archi­tec­ture to accel­er­ate the cre­ation process fur­ther,” adds Yin.

When put to the test against the usu­al meth­ods, using a wide range of bench­marks, DMD showed con­sis­tent per­for­mance. On the pop­u­lar bench­mark of gen­er­at­ing images based on spe­cif­ic class­es on Ima­geNet, DMD is the first one-step dif­fu­sion tech­nique that churns out pic­tures pret­ty much on par with those from the orig­i­nal, more com­plex mod­els, rock­ing a super-close Fréchet incep­tion dis­tance (FID) score of just 0.3, which is impres­sive, since FID is all about judg­ing the qual­i­ty and diver­si­ty of gen­er­at­ed images.

Fur­ther­more, DMD excels in indus­tri­al-scale text-to-image gen­er­a­tion and achieves state-of-the-art one-step gen­er­a­tion per­for­mance. There’s still a slight qual­i­ty gap when tack­ling trick­i­er text-to-image appli­ca­tions, sug­gest­ing there’s a bit of room for improve­ment down the line.

Addi­tion­al­ly, the per­for­mance of the DMD-gen­er­at­ed images is intrin­si­cal­ly linked to the capa­bil­i­ties of the teacher mod­el used dur­ing the dis­til­la­tion process. In the cur­rent form, which uses Sta­ble Dif­fu­sion v1.5 as the teacher mod­el, the stu­dent inher­its lim­i­ta­tions such as ren­der­ing detailed depic­tions of text and small faces, sug­gest­ing that more advanced teacher mod­els could fur­ther enhance DMD-gen­er­at­ed images.

“Decreas­ing the num­ber of iter­a­tions has been the Holy Grail in dif­fu­sion mod­els since their incep­tion,” says Fre­do Durand, MIT pro­fes­sor of elec­tri­cal engi­neer­ing and com­put­er sci­ence, CSAIL prin­ci­pal inves­ti­ga­tor, and a lead author on the paper. “We are very excit­ed to final­ly enable sin­gle-step image gen­er­a­tion, which will dra­mat­i­cal­ly reduce com­pute costs and accel­er­ate the process.”

“Final­ly, a paper that suc­cess­ful­ly com­bines the ver­sa­til­i­ty and high visu­al qual­i­ty of dif­fu­sion mod­els with the real-time per­for­mance of GANs,” says Alex­ei Efros, a pro­fes­sor of elec­tri­cal engi­neer­ing and com­put­er sci­ence at the Uni­ver­si­ty of Cal­i­for­nia at Berke­ley who was not involved in this study. “I expect this work to open up fan­tas­tic pos­si­bil­i­ties for high-qual­i­ty real-time visu­al edit­ing.”

The study is pub­lished on the arX­iv preprint serv­er.

Mass­a­chu­setts Insti­tute of Tech­nol­o­gy