8  Working with Text Data

Text needs to be converted to numbers before neural networks can process it. This chapter covers tokenization, embeddings, and building text classification pipelines.

8.1 Text Preprocessing Pipeline

  1. Tokenization: Split text into words/subwords
  2. Vocabulary: Map each token to a unique integer
  3. Padding: Make all sequences the same length
  4. Embedding: Convert integers to dense vectors

8.2 Tokenization

from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer("basic_english")

text = "Deep learning is awesome! It's changing the world."
tokens = tokenizer(text.lower())

print(f"Original: {text}")
print(f"Tokens: {tokens}")
from tensorflow.keras.preprocessing.text import Tokenizer

texts = [
    "Deep learning is awesome",
    "It's changing the world",
    "I love machine learning"
]

tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(texts)

sequences = tokenizer.texts_to_sequences(texts)

print("Word index:", list(tokenizer.word_index.items())[:10])
print("\nSequences:")
for text, seq in zip(texts, sequences):
    print(f"{text}{seq}")

8.3 Word Embeddings

Word embeddings convert words to dense vectors that capture semantic meaning. Words with similar meanings have similar vectors.

Word2Vec example:

"king" - "man" + "woman" ≈ "queen"
"paris" - "france" + "japan" ≈ "tokyo"
import torch
import torch.nn as nn

# Embedding layer
vocab_size = 10000
embedding_dim = 128

embedding = nn.Embedding(vocab_size, embedding_dim)

# Example: embed a sentence
sentence_indices = torch.tensor([45, 123, 789, 23])  # Word indices
embedded = embedding(sentence_indices)

print(f"Input indices: {sentence_indices}")
print(f"Embedded shape: {embedded.shape}")  # [4, 128]
print(f"First word vector (first 10 dims): {embedded[0, :10]}")
from tensorflow.keras import layers

vocab_size = 10000
embedding_dim = 128

embedding_layer = layers.Embedding(vocab_size, embedding_dim)

# Example: embed a sentence
sentence_indices = tf.constant([[45, 123, 789, 23]])  # Batch of 1 sentence
embedded = embedding_layer(sentence_indices)

print(f"Input indices: {sentence_indices}")
print(f"Embedded shape: {embedded.shape}")  # [1, 4, 128]
print(f"First word vector (first 10 dims): {embedded[0, 0, :10]}")

8.4 Complete Text Classification Pipeline

class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, 128, batch_first=True)
        self.fc = nn.Linear(128, num_classes)

    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, (hidden, _) = self.lstm(embedded)
        return self.fc(hidden[-1])

model = TextClassifier(vocab_size=10000, embedding_dim=128, num_classes=3)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
from tensorflow import keras

model = keras.Sequential([
    layers.Embedding(10000, 128, input_length=100),
    layers.LSTM(128),
    layers.Dense(3, activation='softmax')
])

model.summary()

8.5 Using Pre-trained Embeddings (GloVe)

Pre-trained embeddings like GloVe or Word2Vec capture general language semantics from massive corpora.

# Load GloVe embeddings (conceptual example)
# Download from: https://nlp.stanford.edu/projects/glove/

def load_glove_embeddings(glove_file, word_to_idx, embedding_dim=100):
    embeddings = torch.randn(len(word_to_idx), embedding_dim)
    # Load from file and populate embeddings matrix
    # (Implementation details omitted for brevity)
    return embeddings

# Use pre-trained embeddings
# embedding.weight.data = load_glove_embeddings(...)
# embedding.weight.requires_grad = False  # Freeze if desired

print("✅ Pre-trained embeddings loaded (conceptual)")
import numpy as np

# Load GloVe embeddings (conceptual example)
def load_glove_embeddings(glove_file, tokenizer, embedding_dim=100):
    embeddings_index = {}
    # Load GloVe file (implementation omitted)

    # Create embedding matrix
    vocab_size = len(tokenizer.word_index) + 1
    embedding_matrix = np.zeros((vocab_size, embedding_dim))

    for word, i in tokenizer.word_index.items():
        if word in embeddings_index:
            embedding_matrix[i] = embeddings_index[word]

    return embedding_matrix

# Use pre-trained embeddings
# embedding_matrix = load_glove_embeddings(...)
# embedding_layer = layers.Embedding(vocab_size, 100, weights=[embedding_matrix], trainable=False)

print("✅ Pre-trained embeddings loaded (conceptual)")

8.6 Text Augmentation Techniques

Common techniques: - Synonym replacement: “happy” → “joyful” - Random insertion: Add random words - Random swap: Swap word positions - Back-translation: Translate to another language and back

8.7 Summary

  • Tokenization converts text to sequences of integers
  • Embeddings map integers to dense semantic vectors
  • Pre-trained embeddings (GloVe, Word2Vec) transfer language knowledge
  • Complete pipeline: tokenize → embed → LSTM → classify

8.8 What’s Next?

Chapter 9: Training Deep Networks - optimizers, learning rates, and advanced training techniques!