9  Training Deep Networks

In Chapter 3, we briefly introduced optimizers and learning rates. Now let’s dive deep into optimization strategies that make or break deep learning models.

9.1 Optimizers: Beyond Basic SGD

9.1.1 1. SGD (Stochastic Gradient Descent)

Basic: weights = weights - learning_rate * gradient

Problem: Fixed learning rate, slow convergence, stuck in local minima

9.1.2 2. SGD with Momentum

Adds “momentum” from previous updates, helps escape local minima:

import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
optimizer = keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)

9.1.3 3. Adam (Adaptive Moment Estimation)

Most popular! Adapts learning rate for each parameter.

optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
optimizer = keras.optimizers.Adam(learning_rate=0.001)

9.1.4 4. AdamW (Adam with Weight Decay)

Adam + better regularization:

optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
optimizer = keras.optimizers.experimental.AdamW(learning_rate=0.001, weight_decay=0.01)

9.2 Learning Rate Schedules

The learning rate is the most important hyperparameter. Too high = unstable, too low = slow.

Strategy: Start high, decrease over time.

9.2.1 1. Step Decay

import torch.optim.lr_scheduler as lr_scheduler

optimizer = optim.Adam(model.parameters(), lr=0.001)
scheduler = lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

# In training loop:
# for epoch in range(epochs):
#     train(...)
#     scheduler.step()  # Decay LR every 10 epochs
initial_lr = 0.001
lr_schedule = keras.optimizers.schedules.ExponentialDecay(
    initial_lr,
    decay_steps=1000,
    decay_rate=0.9
)
optimizer = keras.optimizers.Adam(lr_schedule)

9.2.2 2. Cosine Annealing

Smooth decrease following cosine curve:

scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=50, eta_min=1e-6)
lr_schedule = keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=0.001,
    decay_steps=1000
)

9.2.3 3. Learning Rate Warmup

Start with very low LR, gradually increase:

# Pseudocode
if epoch < warmup_epochs:
    lr = initial_lr * (epoch / warmup_epochs)
else:
    lr = initial_lr * decay_schedule

9.3 Batch Size Impact

Batch Size Memory Speed Generalization
Small (8-32) Low Slow Better
Medium (64-128) Moderate Fast Good
Large (256+) High Fastest May overfit

Smart strategy: - Start with largest batch size your GPU can handle - If overfitting, reduce batch size - If underfitting, increase batch size (if possible)

9.4 Gradient Clipping

Prevents exploding gradients (common in RNNs):

import torch.nn.utils as nn_utils

# In training loop:
# loss.backward()
# nn_utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# optimizer.step()

print("Gradient clipping: max_norm=1.0")
optimizer = keras.optimizers.Adam(clipnorm=1.0)
print("Gradient clipping: clipnorm=1.0")

9.5 Mixed Precision Training

Use float16 instead of float32 for 2-3x speedup:

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

# In training loop:
# with autocast():
#     output = model(inputs)
#     loss = criterion(output, targets)
#
# scaler.scale(loss).backward()
# scaler.step(optimizer)
# scaler.update()

print("✅ Mixed precision training enabled")
print("Speedup: 2-3x on modern GPUs")
from tensorflow.keras import mixed_precision

policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(policy)

print("✅ Mixed precision training enabled")
print("Speedup: 2-3x on modern GPUs")

9.6 Complete Training Loop with Best Practices

import torch
import torch.nn as nn

# Setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Optimizer with weight decay
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# Learning rate scheduler
scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

# Loss function
criterion = nn.CrossEntropyLoss()

# Training loop
best_val_loss = float('inf')
patience = 5
patience_counter = 0

for epoch in range(50):
    # Training
    model.train()
    train_loss = 0
    for inputs, targets in train_loader:
        inputs, targets = inputs.to(device), targets.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()

        # Gradient clipping
        nn_utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()
        train_loss += loss.item()

    # Validation
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for inputs, targets in val_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            val_loss += loss.item()

    val_loss /= len(val_loader)

    # Early stopping
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'best_model.pth')
        patience_counter = 0
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"Early stopping at epoch {epoch}")
            break

    # LR scheduling
    scheduler.step()

    print(f"Epoch {epoch}: train_loss={train_loss:.4f}, val_loss={val_loss:.4f}, lr={scheduler.get_last_lr()[0]:.6f}")
# Callbacks for best practices
callbacks = [
    keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True),
    keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=3),
    keras.callbacks.ModelCheckpoint('best_model.h5', save_best_only=True)
]

# Compile with optimizer
optimizer = keras.optimizers.AdamW(learning_rate=0.001, weight_decay=0.01)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train with callbacks
history = model.fit(
    x_train, y_train,
    batch_size=64,
    epochs=50,
    validation_split=0.2,
    callbacks=callbacks
)

9.7 Summary

Optimizer choices: - Adam/AdamW: Default choice, works 90% of the time - SGD+Momentum: Sometimes better final accuracy, needs tuning

Learning rate: - Start: 0.001 (Adam) or 0.01 (SGD) - Use scheduling: StepLR or CosineAnnealing - Warmup for transformers

Batch size: - GPU: 32-128 (depends on memory) - CPU: 8-32

Advanced: - Gradient clipping for RNNs - Mixed precision for 2-3x speedup - Early stopping to prevent overfitting

9.8 What’s Next?

Chapter 10: Regularization techniques to prevent overfitting!