import matplotlib.pyplot as plt
train_losses = [0.8, 0.5, 0.3, 0.2, 0.15, 0.12, 0.10]
val_losses = [0.7, 0.45, 0.35, 0.4, 0.45, 0.48, 0.50]
plt.figure(figsize=(10, 4))
plt.plot(train_losses, label='Training Loss', marker='o')
plt.plot(val_losses, label='Validation Loss', marker='s')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Learning Curves')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
print("⚠️ Overfitting detected!")
print("Training loss decreases but validation loss increases after epoch 2")11 Model Evaluation & Debugging
11.1 Learning Curves
Learning curves show training and validation metrics over time. They reveal if your model is overfitting, underfitting, or just right.
import matplotlib.pyplot as plt
# After training
# history = model.fit(...)
# Plot learning curves
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()11.2 Diagnosing Problems
11.2.1 Underfitting
Symptoms: - Low training accuracy (< 80%) - Training and validation accuracy both low
Solutions: - Increase model capacity (more layers/neurons) - Train longer - Reduce regularization
11.2.2 Overfitting
Symptoms: - High training accuracy (> 95%) - Low validation accuracy (< 80%) - Large gap between train and val
Solutions: - Add regularization (dropout, weight decay) - Get more data - Data augmentation - Early stopping
11.2.3 Just Right
Symptoms: - High training accuracy (~95%) - High validation accuracy (~92%) - Small gap between train and val
Action: You’re done! 🎉
11.3 TensorBoard Visualization
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter('runs/experiment_1')
# In training loop:
# writer.add_scalar('Loss/train', train_loss, epoch)
# writer.add_scalar('Loss/val', val_loss, epoch)
# writer.add_scalar('Accuracy/train', train_acc, epoch)
# writer.close()
# View in terminal: tensorboard --logdir=runs
print("✅ TensorBoard logging setup")
print("Run: tensorboard --logdir=runs")tensorboard_callback = keras.callbacks.TensorBoard(log_dir='logs/')
# model.fit(..., callbacks=[tensorboard_callback])
# View in terminal: tensorboard --logdir=logs
print("✅ TensorBoard logging setup")
print("Run: tensorboard --logdir=logs")11.4 Confusion Matrix
For classification, see where your model makes mistakes:
from sklearn.metrics import confusion_matrix
import seaborn as sns
# After inference
# y_true = ...
# y_pred = ...
# Example data
y_true = [0, 1, 2, 2, 1, 0, 1, 2]
y_pred = [0, 2, 2, 2, 1, 0, 1, 1]
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()from sklearn.metrics import confusion_matrix
import seaborn as sns
# After inference
# y_pred = model.predict(x_test)
# y_pred_classes = np.argmax(y_pred, axis=1)
# Example data
y_true = [0, 1, 2, 2, 1, 0, 1, 2]
y_pred_classes = [0, 2, 2, 2, 1, 0, 1, 1]
cm = confusion_matrix(y_true, y_pred_classes)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()11.5 Model Checkpointing
Save best model during training:
# In training loop
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': val_loss,
}, 'checkpoint.pth')
print(f"✅ Model saved at epoch {epoch}")
# Load checkpoint
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']checkpoint_callback = keras.callbacks.ModelCheckpoint(
'best_model.h5',
monitor='val_loss',
save_best_only=True,
verbose=1
)
# model.fit(..., callbacks=[checkpoint_callback])
# Load checkpoint
model = keras.models.load_model('best_model.h5')11.6 Hyperparameter Tuning
Key hyperparameters to tune: 1. Learning rate (most important!) 2. Batch size 3. Number of layers 4. Neurons per layer 5. Dropout rate
Simple grid search:
learning_rates = [0.1, 0.01, 0.001, 0.0001]
batch_sizes = [16, 32, 64, 128]
for lr in learning_rates:
for bs in batch_sizes:
train_model(lr=lr, batch_size=bs)
# Track results11.7 Common Issues & Fixes
| Problem | Symptom | Solution |
|---|---|---|
| NaN Loss | Loss becomes NaN | Lower learning rate, check data |
| Exploding Gradients | Loss spikes | Gradient clipping, lower LR |
| Slow Convergence | Loss plateaus early | Increase LR, check data normalization |
| No Learning | Loss doesn’t change | Check loss function, verify data flow |
11.8 Summary
- Learning curves diagnose overfitting/underfitting
- TensorBoard visualizes training in real-time
- Confusion matrix shows classification errors
- Checkpointing saves best models
- Hyperparameter tuning improves performance
11.9 What’s Next?
Chapter 12: Building real-world projects and deploying models!