import torch.optim as optim
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)9 Training Deep Networks
In Chapter 3, we briefly introduced optimizers and learning rates. Now let’s dive deep into optimization strategies that make or break deep learning models.
9.1 Optimizers: Beyond Basic SGD
9.1.1 1. SGD (Stochastic Gradient Descent)
Basic: weights = weights - learning_rate * gradient
Problem: Fixed learning rate, slow convergence, stuck in local minima
9.1.2 2. SGD with Momentum
Adds “momentum” from previous updates, helps escape local minima:
optimizer = keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)9.1.3 3. Adam (Adaptive Moment Estimation)
Most popular! Adapts learning rate for each parameter.
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))optimizer = keras.optimizers.Adam(learning_rate=0.001)9.1.4 4. AdamW (Adam with Weight Decay)
Adam + better regularization:
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)optimizer = keras.optimizers.experimental.AdamW(learning_rate=0.001, weight_decay=0.01)9.2 Learning Rate Schedules
The learning rate is the most important hyperparameter. Too high = unstable, too low = slow.
Strategy: Start high, decrease over time.
9.2.1 1. Step Decay
import torch.optim.lr_scheduler as lr_scheduler
optimizer = optim.Adam(model.parameters(), lr=0.001)
scheduler = lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
# In training loop:
# for epoch in range(epochs):
# train(...)
# scheduler.step() # Decay LR every 10 epochsinitial_lr = 0.001
lr_schedule = keras.optimizers.schedules.ExponentialDecay(
initial_lr,
decay_steps=1000,
decay_rate=0.9
)
optimizer = keras.optimizers.Adam(lr_schedule)9.2.2 2. Cosine Annealing
Smooth decrease following cosine curve:
scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=50, eta_min=1e-6)lr_schedule = keras.optimizers.schedules.CosineDecay(
initial_learning_rate=0.001,
decay_steps=1000
)9.2.3 3. Learning Rate Warmup
Start with very low LR, gradually increase:
# Pseudocode
if epoch < warmup_epochs:
lr = initial_lr * (epoch / warmup_epochs)
else:
lr = initial_lr * decay_schedule9.3 Batch Size Impact
| Batch Size | Memory | Speed | Generalization |
|---|---|---|---|
| Small (8-32) | Low | Slow | Better |
| Medium (64-128) | Moderate | Fast | Good |
| Large (256+) | High | Fastest | May overfit |
Smart strategy: - Start with largest batch size your GPU can handle - If overfitting, reduce batch size - If underfitting, increase batch size (if possible)
9.4 Gradient Clipping
Prevents exploding gradients (common in RNNs):
import torch.nn.utils as nn_utils
# In training loop:
# loss.backward()
# nn_utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# optimizer.step()
print("Gradient clipping: max_norm=1.0")optimizer = keras.optimizers.Adam(clipnorm=1.0)
print("Gradient clipping: clipnorm=1.0")9.5 Mixed Precision Training
Use float16 instead of float32 for 2-3x speedup:
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
# In training loop:
# with autocast():
# output = model(inputs)
# loss = criterion(output, targets)
#
# scaler.scale(loss).backward()
# scaler.step(optimizer)
# scaler.update()
print("✅ Mixed precision training enabled")
print("Speedup: 2-3x on modern GPUs")from tensorflow.keras import mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(policy)
print("✅ Mixed precision training enabled")
print("Speedup: 2-3x on modern GPUs")9.6 Complete Training Loop with Best Practices
import torch
import torch.nn as nn
# Setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Optimizer with weight decay
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
# Learning rate scheduler
scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
# Loss function
criterion = nn.CrossEntropyLoss()
# Training loop
best_val_loss = float('inf')
patience = 5
patience_counter = 0
for epoch in range(50):
# Training
model.train()
train_loss = 0
for inputs, targets in train_loader:
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
# Gradient clipping
nn_utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
train_loss += loss.item()
# Validation
model.eval()
val_loss = 0
with torch.no_grad():
for inputs, targets in val_loader:
inputs, targets = inputs.to(device), targets.to(device)
outputs = model(inputs)
loss = criterion(outputs, targets)
val_loss += loss.item()
val_loss /= len(val_loader)
# Early stopping
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save(model.state_dict(), 'best_model.pth')
patience_counter = 0
else:
patience_counter += 1
if patience_counter >= patience:
print(f"Early stopping at epoch {epoch}")
break
# LR scheduling
scheduler.step()
print(f"Epoch {epoch}: train_loss={train_loss:.4f}, val_loss={val_loss:.4f}, lr={scheduler.get_last_lr()[0]:.6f}")# Callbacks for best practices
callbacks = [
keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True),
keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=3),
keras.callbacks.ModelCheckpoint('best_model.h5', save_best_only=True)
]
# Compile with optimizer
optimizer = keras.optimizers.AdamW(learning_rate=0.001, weight_decay=0.01)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Train with callbacks
history = model.fit(
x_train, y_train,
batch_size=64,
epochs=50,
validation_split=0.2,
callbacks=callbacks
)9.7 Summary
Optimizer choices: - Adam/AdamW: Default choice, works 90% of the time - SGD+Momentum: Sometimes better final accuracy, needs tuning
Learning rate: - Start: 0.001 (Adam) or 0.01 (SGD) - Use scheduling: StepLR or CosineAnnealing - Warmup for transformers
Batch size: - GPU: 32-128 (depends on memory) - CPU: 8-32
Advanced: - Gradient clipping for RNNs - Mixed precision for 2-3x speedup - Early stopping to prevent overfitting
9.8 What’s Next?
Chapter 10: Regularization techniques to prevent overfitting!