Skip to content

Natural Language Processing & Word Embeddings:Sequence Models(Deep Learning Specialization)Answeres:2025

Question 1

True/False: Embedding vectors could be 60,000 dimensional to capture full variation.

  • False

  • ❌ True

Explanation: While you could make embeddings as large as the vocabulary, that’s unnecessary and wasteful. Good embeddings are low-dimensional (e.g., 50–1,000) and capture semantics via learned structure — increasing dimension to the vocab size mostly increases parameters and overfitting without practical benefit.


Question 2

What is t-SNE?

  • ❌ A supervised learning algorithm for learning word embeddings

  • ❌ An open-source sequence modeling library

  • A non-linear dimensionality reduction technique

  • ❌ A linear transformation that allows us to solve analogies on word vectors

Explanation: t-SNE (t-distributed stochastic neighbor embedding) is a nonlinear method for visualizing high-dimensional data in 2D/3D while preserving local neighborhood structure.


Question 3

Pretrained embeddings help recognize unseen synonyms (e.g., “upset”). True/False?

  • ❌ False

  • True

Explanation: Pretrained embeddings place semantically similar words (e.g., “sad”, “upset”, “unhappy”) near each other in vector space, so an RNN using those embeddings can generalize to words not present in your small labeled dataset.


Question 4

Which analogy equations should hold for a good embedding? (Check all that apply)

  • eman−euncle≈ewoman−eaunte_{man} – e_{uncle} \approx e_{woman} – e_{aunt}

  • eman−ewoman≈euncle−eaunte_{man} – e_{woman} \approx e_{uncle} – e_{aunt}

  • eman−eaunt≈ewoman−eunclee_{man} – e_{aunt} \approx e_{woman} – e_{uncle}

  • eman−ewoman≈eaunt−eunclee_{man} – e_{woman} \approx e_{aunt} – e_{uncle}

Explanation: Good embeddings often capture consistent relations: the male–female difference (man − woman) should be similar across analogous roles (uncle − aunt). The others are either reversed signs or unlikely analogies.


Question 5

Most computationally efficient formula in Python to get embedding of word 1021 if CC is embedding matrix and o1021o_{1021} a one-hot vector: CT∗o1021C^T * o_{1021}. True/False?

  • False

  • ❌ True

Explanation: The fastest way is direct indexing (e.g., C[1021]). Multiplying by a one-hot via matrix multiply is wasteful. Also whether C^T * o is correct depends on shape conventions — but direct lookup is both simpler and more efficient.


Question 6

When learning embeddings, we pick a word and predict surrounding words (or vice versa). True/False?

  • True

  • ❌ False

Explanation: Both Skip-Gram (predict context from word) and CBOW (predict word from context) follow this idea — learning embeddings by modeling local co-occurrence.


Question 7

In word2vec you estimate P(t∣c)P(t\mid c) with t target and c context chosen as the sequence of all words before t. True/False?

  • False

  • ❌ True

Explanation: Word2vec typically uses context words from a window around the target (both previous and following words), not only the words before t.


Question 8

After training, should we expect θt\theta_t to be very close to ece_c when tt and cc are the same word? True/False?

  • True

  • ❌ False

Explanation: In the softmax formulation, the model encourages θt⊤ec\theta_t^\top e_c to be large when t=ct=c. So the output vector θt\theta_t and the input embedding ece_c will often become similar for the same word.


Question 9

In GloVe, is XijX_{ij} the number of times word jj appears in the context of word ii? True/False?

  • ❌ False

  • True

Explanation: GloVe uses a global word–word co-occurrence matrix XX, where XijX_{ij} counts how often word jj appears in the context of word ii (within a context window across the corpus).


Question 10

When will pretrained embeddings be helpful: s1≪s2s_1 \ll s_2 or s1≫s2s_1 \gg s_2?

  • s1≪s2s_1 \ll s_2

  • s1≫s2s_1 \gg s_2

Explanation: Pretraining helps most when the unlabeled/pretraining corpus s1s_1 is much larger than the labeled dataset s2s_2. Large source data yields richer embeddings that transfer well to smaller target tasks.


🧾 Summary Table

Q # Correct Answer(s) Key concept
1 ✅ False Embeddings don’t need to equal vocab size; huge dims are inefficient.
2 ✅ Nonlinear dimensionality reduction t-SNE = visualization / nonlinear DR.
3 ✅ True Pretrained embeddings help generalize to unseen but similar words.
4 eman−ewoman≈euncle−eaunte_{man}-e_{woman}\approx e_{uncle}-e_{aunt} Embeddings capture consistent relation vectors.
5 ✅ False Direct indexing (C[idx]) is more efficient than multiplying by one-hot.
6 ✅ True Skip-gram / CBOW predict surrounding words (context).
7 ✅ False Context usually includes both sides, not only preceding words.
8 ✅ True Output and input vectors for same word tend to align.
9 ✅ True GloVe uses global co-occurrence counts XijX_{ij}.
10 s1≫s2s_1 \gg s_2 Pretraining helps most when source corpus is much larger than target.