Abstract
Renal cell carcinoma tumor images are utilized in various fields for critical functions. Application is constrained in unique scenarios requiring specific tumor imaging, which is often difficult to obtain due to rarity or privacy concerns. While general tumor synthesis has been successful as a data acquisition solution, specific domains demand precise control and accurate depiction of tumor characteristics. Our study addresses these limitations by integrating RENAL score guidelines into the synthesis process, enabling clinically instructed tumor synthesis tailored to specific medical demands. In this work, we introduce a generative framework that begins by decoding a segmentation mask from the textual outputs of a multimodal large language model, using clinical descriptions from RENAL score descriptors. The decoded mask is then integrated into a latent diffusion model, transforming a healthy volume into a tumor-bearing one. Our results demonstrate a high degree of alignment between the textual queries and the generated tumors, and the synthetic tumors closely replicate those found in other synthetic and real-world sources.