Graded Quiz: Integrating Visual and Video Modalities :Build Multimodal Generative AI Applications (BM RAG and Agentic AI Professional Certificate) Answers 2025
Question 1
What should Alex focus on for accurate Sora video generation?
❌ Ensure video under 2 minutes
❌ High-resolution monitor
❌ Length of text prompt
✅ Crafting detailed and structured text prompts
Explanation:
Sora relies heavily on prompt detail—well-structured prompts produce accurate, controlled video outputs.
Question 2
What should Chris include in a Sora video prompt?
❌ Font style, music
❌ Color scheme, alignment
❌ Duration, frame rate, resolution
✅ Scene context, visual details, and motion
Explanation:
Clear scene description + visual details + camera/motion cues help Sora generate high-quality videos.
Question 3
Which stage converts an image into a model-ready format?
❌ Data augmentation
❌ Multimodal LLM processing
❌ Image validation alone
✅ Input processing
Explanation:
Input processing handles preprocessing—resizing, normalizing, encoding—to prepare the image for the model.
Question 4
What combines visual features + text embeddings?
❌ Language generation component
❌ Image validator
✅ Multimodal fusion layer
❌ Visual encoder
Explanation:
The fusion layer merges image embeddings and text embeddings into one representation for captioning.
Question 5
Which step prepares images by encoding them?
❌ Initialize model
❌ API setup
❌ Create message
✅ Encoding images to bytes for LLM processing
Explanation:
Images must be converted into encoded tensor/byte format before the model can process them.
Question 6
Which component rejects invalid images?
❌ Language generation
❌ Text embedding
❌ Fusion layer
✅ Image validation step during preprocessing
Explanation:
The validation phase checks format, size, safety, and feasibility—rejecting unusable images early.
Question 7
Which three stages describe the multimodal captioning pipeline?
❌ Dataset labeling, caption selection
❌ Tokenization, grammar checking
❌ Editing, tagging
✅ Input processing → image validation & encoding → multimodal LLM processing
Explanation:
This is the standard workflow: prepare input → validate/encode → generate caption.
🧾 Summary Table
| Q No. | Correct Answer | Key Concept |
|---|---|---|
| 1 | Detailed & structured prompts | Sora prompt quality |
| 2 | Scene context + visual details + motion | Video generation prompt design |
| 3 | Input processing | Image → model format |
| 4 | Multimodal fusion layer | Combine image + text embeddings |
| 5 | Encode images to bytes | Preprocessing for Llama 4 |
| 6 | Image validation step | Rejecting images |
| 7 | Processing → encoding → LLM | Multimodal caption pipeline |