Dimensionality Reduction Adventures with Animal Faces

Summary This assignment guides students from the fundamentals of PCA to hands-on implementation and applied analysis on real-world image data. Students explore PCA notation, implement PCA using SVD, and validate their results against scikit-learn. They then apply PCA to a subset of the Animal Faces dataset, evaluating reconstruction quality and interpreting learned components. A challenging but low-stake extension invites students to implement a CNN autoencoder, compare it with PCA, and experiment with different design choices.
Topics Dimensionality reduction, Principal Component Analysis (PCA), Singular Value Decomposition (SVD), reconstruction error, explained variance, model interpretation, autoencoders
Audience Introductory AI / Machine Learning (ML) students at the undergraduate or early graduate level, including students from multidisciplinary backgrounds who want to apply ML methods in their domains
Difficulty Moderate. Suitable for students with basic linear algebra and Python knowledge. Core PCA exercises can be completed in 3–5 hours; the optional autoencoder extension is more open-ended.
Strengths
  • Clear bridge from theory → implementation → real data.
  • Highly visual: reconstructions, variance plots, component interpretation.
  • Emphasizes reasoning and interpretation, not just coding.
  • Low-stake, challenging extension on autoencoders.
  • Adoptable & reproducible: starter code, utilities, autograder support for warm-up exercises, and includes both instructor version (with solutions) and student version (without solutions).
  • Naturally extensible: can be adapted to other datasets and dimensionality reduction methods.
Weaknesses
  • Focuses primarily on PCA. Broader coverage (e.g., NMF, LSA for text data) and comparing and contrasting these methods could enrich it further could enrich it further.
  • Some students may need additional support with linear algebra foundations.
  • This assignment is designed to give a high-level understanding of PCA for a diverse audience so that they can confidently use it in their own applications. The focus is not on programming or on deeper mathematical details—such as the structure of the covariance matrix or why its eigenvectors are meaningful, but rather on building intuition for how and when PCA is useful.
Dependencies
  • Prerequisites: linear algebra (matrix multiplication, eigenvectors), Fundamentals of PCA, Python programming.
  • Data: Animal Faces dataset (subset/processed for the assignment).
  • Packages: scikit-learn, NumPy, matplotlib, PyTorch (for the extension).
  • An environment file (environment.yml) is provided for reproducibility.
Variants
  • A valuable extension is to compare different feature representations for downstream tasks such as image clustering or classification. Students can contrast raw flattened pixels, PCA-reduced representations, and features extracted from pretrained CNNs (e.g., ResNet-50, EfficientNet, or Vision Transformers), and reason about why certain representations lead to better performance.
  • Apply PCA to different datasets and evaluate when it is an appropriate or inappropriate choice. This helps students reflect on PCA's underlying assumptions—such as the linear-subspace model—and understand situations involving nonlinear manifolds where PCA may struggle.
  • Explore PCA variants, such as Robust PCA, Kernel PCA or randomized PCA.
  • Investigate how many components are "enough". Students can compare different methods for choosing the number of components (explained variance, scree plots, cross-validation for downstream tasks) and analyze how these methods behave in different contexts.
  • Use PCA as a preprocessing step for supervised learning models and evaluate how it affects performance. Students can compare models trained on raw data versus PCA-transformed data and reason about which models benefit most from dimensionality reduction and why.
  • Use PCA for outlier detection by leveraging reconstruction error. Students can analyze whether the method identifies meaningful anomalies in a real-world dataset and discuss its strengths and limitations.
  • Another extension is to experiment with autoencoders by varying architectural choices (number of convolutional blocks, bottleneck size, skip connections), training configurations (epochs, batch size, learning rate), and regularization techniques, then evaluating reconstruction quality and downstream utility.


Where to find assignment materials?

The assignment materials include a student version of the assignment without solutions. An instructor version with solutions is available to instructors upon request. Below is the repository structure:

            ├── environment.yml # Conda environment with required packages                               
            ├── index.html # Summary page for the assignment                                    
            ├── README.md # Instructions and overview                                     
            ├── LICENSE.md # License information                                    
            └── student/ # Student version (no solutions)                                                                    
                ├── data/
                │   └── animal_faces.pkl # Processed dataset. Download from https://github.com/kvarada/EAAI26-PCA-assignment-data/blob/main/animal_faces.pkl.zip
                ├── img/
                ├── model_assignment_PCA.html # student notebook html version   
                ├── model_assignment_PCA.ipynb # student notebook ipynb version
                └── utils.py # Helper functions