Tuesday, July 1, 2025
Social icon element need JNews Essential plugin to be activated.
No Result
View All Result
Digital Currency Pulse
  • Home
  • Crypto/Coins
  • NFT
  • AI
  • Blockchain
  • Metaverse
  • Web3
  • Exchanges
  • DeFi
  • Scam Alert
  • Analysis
Crypto Marketcap
Digital Currency Pulse
  • Home
  • Crypto/Coins
  • NFT
  • AI
  • Blockchain
  • Metaverse
  • Web3
  • Exchanges
  • DeFi
  • Scam Alert
  • Analysis
No Result
View All Result
Digital Currency Pulse
No Result
View All Result

Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training

October 11, 2024
in Artificial Intelligence
Reading Time: 6 mins read
A A
0

[ad_1]

Producing correct and aesthetically interesting visible texts in text-to-image era fashions presents a big problem. Whereas diffusion-based fashions have achieved success in creating various and high-quality pictures, they usually wrestle to provide legible and well-placed visible textual content. Frequent points embody misspellings, omitted phrases, and improper textual content alignment, notably when producing non-English languages similar to Chinese language. These limitations limit the applicability of such fashions in real-world use circumstances like digital media manufacturing and promoting, the place exact visible textual content era is important.

Present strategies for visible textual content era sometimes embed textual content instantly into the mannequin’s latent house or impose positional constraints throughout picture era. Nevertheless, these approaches include limitations. Byte Pair Encoding (BPE), generally used for tokenization in these fashions, breaks down phrases into subwords, complicating the era of coherent and legible textual content. Furthermore, the cross-attention mechanisms in these fashions will not be totally optimized, leading to weak alignment between the generated visible textual content and the enter tokens. Options similar to TextDiffuser and GlyphDraw try to unravel these issues with inflexible positional constraints or inpainting methods, however this usually results in restricted visible range and inconsistent textual content integration. Moreover, most present fashions solely deal with English textual content, leaving gaps of their capacity to generate correct texts in different languages, particularly Chinese language.

Researchers from Xiamen College, Baidu Inc., and Shanghai Synthetic Intelligence Laboratory launched two core improvements: enter granularity management and glyph-aware coaching. The combined granularity enter technique represents whole phrases as an alternative of subwords, bypassing the challenges posed by BPE tokenization and permitting for extra coherent textual content era. Moreover, a brand new coaching regime was launched, incorporating three key losses: (1) consideration alignment loss, which reinforces the cross-attention mechanisms by bettering text-to-token alignment; (2) native MSE loss, which ensures the mannequin focuses on crucial textual content areas throughout the picture; and (3) OCR recognition loss, designed to drive accuracy within the generated textual content. These mixed methods enhance each the visible and semantic features of textual content era whereas sustaining the standard of picture synthesis.

This method makes use of a latent diffusion framework with three foremost elements: a Variational Autoencoder (VAE) for encoding and decoding pictures, a UNet denoiser to handle the diffusion course of, and a textual content encoder to deal with enter prompts. To counter the challenges posed by BPE tokenization, the researchers employed a combined granularity enter technique, treating phrases as entire models quite than subwords. An OCR mannequin can be built-in to extract glyph-level options, refining the textual content embeddings utilized by the mannequin.

The mannequin is educated utilizing a dataset comprising 240,000 English samples and 50,000 Chinese language samples, filtered to make sure high-quality pictures with clear and coherent visible textual content. Each SD-XL and SDXL-Turbo spine fashions have been utilized, with coaching performed over 10,000 steps at a studying price of 2e-5.

This resolution exhibits vital enhancements in each textual content era accuracy and visible attraction. Precision, recall, and F1 scores for English and Chinese language textual content era notably surpass these of current strategies. For instance, OCR precision reaches 0.360, outperforming different baseline fashions like SD-XL and LCM-LoRA. The tactic generates extra legible, visually interesting textual content and integrates it extra seamlessly into pictures. Moreover, the brand new glyph-aware coaching technique permits multilingual assist, with the mannequin successfully dealing with Chinese language textual content era—an space the place prior fashions fall brief. These outcomes spotlight the mannequin’s superior capacity to provide correct and aesthetically coherent visible textual content, whereas sustaining the general high quality of the generated pictures throughout completely different languages.

In conclusion, the tactic developed right here advances the sphere of visible textual content era by addressing crucial challenges associated to tokenization and cross-attention mechanisms. The introduction of enter granularity management and glyph-aware coaching permits the era of correct, aesthetically pleasing textual content in each English and Chinese language. These improvements improve the sensible purposes of text-to-image fashions, notably in areas requiring exact multilingual textual content era.

Take a look at the Paper. All credit score for this analysis goes to the researchers of this mission. Additionally, don’t overlook to comply with us on Twitter and be a part of our Telegram Channel and LinkedIn Group. For those who like our work, you’ll love our publication.. Don’t Neglect to affix our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Knowledge Retrieval Convention (Promoted)

Aswin AK is a consulting intern at MarkTechPost. He’s pursuing his Twin Diploma on the Indian Institute of Know-how, Kharagpur. He’s obsessed with knowledge science and machine studying, bringing a robust tutorial background and hands-on expertise in fixing real-life cross-domain challenges.

[ad_2]

Source link

Tags: BackbonecontrolEmpoweringGenerationGlyphAwareGranularityInputmodelsTextTrainingVisual
Previous Post

Crypto whale loses $35M on Blast Network in phishing attack

Next Post

Files Cross-Appeal Against the SEC

Next Post
Files Cross-Appeal Against the SEC

Files Cross-Appeal Against the SEC

A Look at the Growing Crypto Exchange

A Look at the Growing Crypto Exchange

Worldcoin Taps Dune to Bring On-Chain Data Access to Its Users

Worldcoin Taps Dune to Bring On-Chain Data Access to Its Users

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Social icon element need JNews Essential plugin to be activated.

CATEGORIES

  • Analysis
  • Artificial Intelligence
  • Blockchain
  • Crypto/Coins
  • DeFi
  • Exchanges
  • Metaverse
  • NFT
  • Scam Alert
  • Web3
No Result
View All Result

SITEMAP

  • About us
  • Disclaimer
  • DMCA
  • Privacy Policy
  • Terms and Conditions
  • Cookie Privacy Policy
  • Contact us

Copyright © 2024 Digital Currency Pulse.
Digital Currency Pulse is not responsible for the content of external sites.

No Result
View All Result
  • Home
  • Crypto/Coins
  • NFT
  • AI
  • Blockchain
  • Metaverse
  • Web3
  • Exchanges
  • DeFi
  • Scam Alert
  • Analysis
Crypto Marketcap

Copyright © 2024 Digital Currency Pulse.
Digital Currency Pulse is not responsible for the content of external sites.