Natural Language Processing & Word Embeddings:Sequence Models(Deep Learning Specialization)Answeres:2025
Question 1
True/False: Embedding vectors could be 60,000 dimensional to capture full variation.
-
✅ False
-
❌ True
Explanation: While you could make embeddings as large as the vocabulary, that’s unnecessary and wasteful. Good embeddings are low-dimensional (e.g., 50–1,000) and capture semantics via learned structure — increasing dimension to the vocab size mostly increases parameters and overfitting without practical benefit.
Question 2
What is t-SNE?
-
❌ A supervised learning algorithm for learning word embeddings
-
❌ An open-source sequence modeling library
-
✅ A non-linear dimensionality reduction technique
-
❌ A linear transformation that allows us to solve analogies on word vectors
Explanation: t-SNE (t-distributed stochastic neighbor embedding) is a nonlinear method for visualizing high-dimensional data in 2D/3D while preserving local neighborhood structure.
Question 3
Pretrained embeddings help recognize unseen synonyms (e.g., “upset”). True/False?
-
❌ False
-
✅ True
Explanation: Pretrained embeddings place semantically similar words (e.g., “sad”, “upset”, “unhappy”) near each other in vector space, so an RNN using those embeddings can generalize to words not present in your small labeled dataset.
Question 4
Which analogy equations should hold for a good embedding? (Check all that apply)
-
❌ eman−euncle≈ewoman−eaunte_{man} – e_{uncle} \approx e_{woman} – e_{aunt}eman−euncle≈ewoman−eaunt
-
✅ eman−ewoman≈euncle−eaunte_{man} – e_{woman} \approx e_{uncle} – e_{aunt}eman−ewoman≈euncle−eaunt
-
❌ eman−eaunt≈ewoman−eunclee_{man} – e_{aunt} \approx e_{woman} – e_{uncle}eman−eaunt≈ewoman−euncle
-
❌ eman−ewoman≈eaunt−eunclee_{man} – e_{woman} \approx e_{aunt} – e_{uncle}eman−ewoman≈eaunt−euncle
Explanation: Good embeddings often capture consistent relations: the male–female difference (man − woman) should be similar across analogous roles (uncle − aunt). The others are either reversed signs or unlikely analogies.
Question 5
Most computationally efficient formula in Python to get embedding of word 1021 if CCC is embedding matrix and o1021o_{1021}o1021 a one-hot vector: CT∗o1021C^T * o_{1021}CT∗o1021. True/False?
-
✅ False
-
❌ True
Explanation: The fastest way is direct indexing (e.g., C[1021]). Multiplying by a one-hot via matrix multiply is wasteful. Also whether C^T * o is correct depends on shape conventions — but direct lookup is both simpler and more efficient.
Question 6
When learning embeddings, we pick a word and predict surrounding words (or vice versa). True/False?
-
✅ True
-
❌ False
Explanation: Both Skip-Gram (predict context from word) and CBOW (predict word from context) follow this idea — learning embeddings by modeling local co-occurrence.
Question 7
In word2vec you estimate P(t∣c)P(t\mid c)P(t∣c) with t target and c context chosen as the sequence of all words before t. True/False?
-
✅ False
-
❌ True
Explanation: Word2vec typically uses context words from a window around the target (both previous and following words), not only the words before t.
Question 8
After training, should we expect θt\theta_tθt to be very close to ece_cec when ttt and ccc are the same word? True/False?
-
✅ True
-
❌ False
Explanation: In the softmax formulation, the model encourages θt⊤ec\theta_t^\top e_cθt⊤ec to be large when t=ct=ct=c. So the output vector θt\theta_tθt and the input embedding ece_cec will often become similar for the same word.
Question 9
In GloVe, is XijX_{ij}Xij the number of times word jjj appears in the context of word iii? True/False?
-
❌ False
-
✅ True
Explanation: GloVe uses a global word–word co-occurrence matrix XXX, where XijX_{ij}Xij counts how often word jjj appears in the context of word iii (within a context window across the corpus).
Question 10
When will pretrained embeddings be helpful: s1≪s2s_1 \ll s_2s1≪s2 or s1≫s2s_1 \gg s_2s1≫s2?
-
❌ s1≪s2s_1 \ll s_2s1≪s2
-
✅ s1≫s2s_1 \gg s_2s1≫s2
Explanation: Pretraining helps most when the unlabeled/pretraining corpus s1s_1s1 is much larger than the labeled dataset s2s_2s2. Large source data yields richer embeddings that transfer well to smaller target tasks.
🧾 Summary Table
| Q # | Correct Answer(s) | Key concept |
|---|---|---|
| 1 | ✅ False | Embeddings don’t need to equal vocab size; huge dims are inefficient. |
| 2 | ✅ Nonlinear dimensionality reduction | t-SNE = visualization / nonlinear DR. |
| 3 | ✅ True | Pretrained embeddings help generalize to unseen but similar words. |
| 4 | ✅ eman−ewoman≈euncle−eaunte_{man}-e_{woman}\approx e_{uncle}-e_{aunt}eman−ewoman≈euncle−eaunt | Embeddings capture consistent relation vectors. |
| 5 | ✅ False | Direct indexing (C[idx]) is more efficient than multiplying by one-hot. |
| 6 | ✅ True | Skip-gram / CBOW predict surrounding words (context). |
| 7 | ✅ False | Context usually includes both sides, not only preceding words. |
| 8 | ✅ True | Output and input vectors for same word tend to align. |
| 9 | ✅ True | GloVe uses global co-occurrence counts XijX_{ij}Xij. |
| 10 | ✅ s1≫s2s_1 \gg s_2s1≫s2 | Pretraining helps most when source corpus is much larger than target. |