Skip to content

Graded Quiz: Integrating Visual and Video Modalities :Build Multimodal Generative AI Applications (BM RAG and Agentic AI Professional Certificate) Answers 2025

Question 1

What should Alex focus on for accurate Sora video generation?

❌ Ensure video under 2 minutes
❌ High-resolution monitor
❌ Length of text prompt
Crafting detailed and structured text prompts

Explanation:
Sora relies heavily on prompt detail—well-structured prompts produce accurate, controlled video outputs.


Question 2

What should Chris include in a Sora video prompt?

❌ Font style, music
❌ Color scheme, alignment
❌ Duration, frame rate, resolution
Scene context, visual details, and motion

Explanation:
Clear scene description + visual details + camera/motion cues help Sora generate high-quality videos.


Question 3

Which stage converts an image into a model-ready format?

❌ Data augmentation
❌ Multimodal LLM processing
❌ Image validation alone
Input processing

Explanation:
Input processing handles preprocessing—resizing, normalizing, encoding—to prepare the image for the model.


Question 4

What combines visual features + text embeddings?

❌ Language generation component
❌ Image validator
Multimodal fusion layer
❌ Visual encoder

Explanation:
The fusion layer merges image embeddings and text embeddings into one representation for captioning.


Question 5

Which step prepares images by encoding them?

❌ Initialize model
❌ API setup
❌ Create message
Encoding images to bytes for LLM processing

Explanation:
Images must be converted into encoded tensor/byte format before the model can process them.


Question 6

Which component rejects invalid images?

❌ Language generation
❌ Text embedding
❌ Fusion layer
Image validation step during preprocessing

Explanation:
The validation phase checks format, size, safety, and feasibility—rejecting unusable images early.


Question 7

Which three stages describe the multimodal captioning pipeline?

❌ Dataset labeling, caption selection
❌ Tokenization, grammar checking
❌ Editing, tagging
Input processing → image validation & encoding → multimodal LLM processing

Explanation:
This is the standard workflow: prepare input → validate/encode → generate caption.


🧾 Summary Table

Q No. Correct Answer Key Concept
1 Detailed & structured prompts Sora prompt quality
2 Scene context + visual details + motion Video generation prompt design
3 Input processing Image → model format
4 Multimodal fusion layer Combine image + text embeddings
5 Encode images to bytes Preprocessing for Llama 4
6 Image validation step Rejecting images
7 Processing → encoding → LLM Multimodal caption pipeline