Natural Language Processing & Word Embeddings:Sequence Models(Deep Learning Specialization)Answeres:2025

Question 1

True/False: Embedding vectors could be 60,000 dimensional to capture full variation.

✅ False
❌ True

Explanation: While you could make embeddings as large as the vocabulary, that’s unnecessary and wasteful. Good embeddings are low-dimensional (e.g., 50–1,000) and capture semantics via learned structure — increasing dimension to the vocab size mostly increases parameters and overfitting without practical benefit.

Question 2

What is t-SNE?

❌ A supervised learning algorithm for learning word embeddings
❌ An open-source sequence modeling library
✅ A non-linear dimensionality reduction technique
❌ A linear transformation that allows us to solve analogies on word vectors

Explanation: t-SNE (t-distributed stochastic neighbor embedding) is a nonlinear method for visualizing high-dimensional data in 2D/3D while preserving local neighborhood structure.

Question 3

Pretrained embeddings help recognize unseen synonyms (e.g., “upset”). True/False?

❌ False
✅ True

Explanation: Pretrained embeddings place semantically similar words (e.g., “sad”, “upset”, “unhappy”) near each other in vector space, so an RNN using those embeddings can generalize to words not present in your small labeled dataset.

Question 4

Which analogy equations should hold for a good embedding? (Check all that apply)

❌ $eman−euncle≈ewoman−eaunte_{man} – e_{uncle} \approx e_{woman} – e_{aunt}$
✅ $eman−ewoman≈euncle−eaunte_{man} – e_{woman} \approx e_{uncle} – e_{aunt}$
❌ $eman−eaunt≈ewoman−eunclee_{man} – e_{aunt} \approx e_{woman} – e_{uncle}$
❌ $eman−ewoman≈eaunt−eunclee_{man} – e_{woman} \approx e_{aunt} – e_{uncle}$

Explanation: Good embeddings often capture consistent relations: the male–female difference (man − woman) should be similar across analogous roles (uncle − aunt). The others are either reversed signs or unlikely analogies.

Question 5

Most computationally efficient formula in Python to get embedding of word 1021 if $C$ is embedding matrix and $o_{1021}$ a one-hot vector: $C^T * o_{1021}$ . True/False?

✅ False
❌ True

Explanation: The fastest way is direct indexing (e.g., C[1021]). Multiplying by a one-hot via matrix multiply is wasteful. Also whether C^T * o is correct depends on shape conventions — but direct lookup is both simpler and more efficient.

Question 6

When learning embeddings, we pick a word and predict surrounding words (or vice versa). True/False?

✅ True
❌ False

Explanation: Both Skip-Gram (predict context from word) and CBOW (predict word from context) follow this idea — learning embeddings by modeling local co-occurrence.

Question 7

In word2vec you estimate $P(t∣c)P(t\mid c)$ with t target and c context chosen as the sequence of all words before t. True/False?

✅ False
❌ True

Explanation: Word2vec typically uses context words from a window around the target (both previous and following words), not only the words before t.

Question 8

After training, should we expect $θt\theta_t$ to be very close to $e_c$ when $t$ and $c$ are the same word? True/False?

✅ True
❌ False

Explanation: In the softmax formulation, the model encourages $θt⊤ec\theta_t^\top e_c$ to be large when $t = c$ . So the output vector $θt\theta_t$ and the input embedding $e_c$ will often become similar for the same word.

Question 9

In GloVe, is $X_{ij}$ the number of times word $j$ appears in the context of word $i$ ? True/False?

❌ False
✅ True

Explanation: GloVe uses a global word–word co-occurrence matrix $X$ , where $X_{ij}$ counts how often word $j$ appears in the context of word $i$ (within a context window across the corpus).

Question 10

When will pretrained embeddings be helpful: $s1≪s2s_1 \ll s_2$ or $s1≫s2s_1 \gg s_2$ ?

❌ $s1≪s2s_1 \ll s_2$
✅ $s1≫s2s_1 \gg s_2$

Explanation: Pretraining helps most when the unlabeled/pretraining corpus $s_1$ is much larger than the labeled dataset $s_2$ . Large source data yields richer embeddings that transfer well to smaller target tasks.

🧾 Summary Table

Q #	Correct Answer(s)	Key concept
1	✅ False	Embeddings don’t need to equal vocab size; huge dims are inefficient.
2	✅ Nonlinear dimensionality reduction	t-SNE = visualization / nonlinear DR.
3	✅ True	Pretrained embeddings help generalize to unseen but similar words.
4	✅ $eman−ewoman≈euncle−eaunte_{man}-e_{woman}\approx e_{uncle}-e_{aunt}$	Embeddings capture consistent relation vectors.
5	✅ False	Direct indexing (`C[idx]`) is more efficient than multiplying by one-hot.
6	✅ True	Skip-gram / CBOW predict surrounding words (context).
7	✅ False	Context usually includes both sides, not only preceding words.
8	✅ True	Output and input vectors for same word tend to align.
9	✅ True	GloVe uses global co-occurrence counts $X_{ij}$ .
10	✅ $s1≫s2s_1 \gg s_2$	Pretraining helps most when source corpus is much larger than target.