Architectural scaling
=====================

MaldiDeepKit's default hyperparameters are calibrated for the MaldiAMRKit
default layout: ``bin_width = 3`` over the 2000-20000 Da range, giving
``input_dim = 6000``. Two things can change that layout:

- the user picks a different **bin width** (wider or finer bins), and
- the user **trims** the spectrum to a narrower m/z range, shrinking
  ``input_dim`` independently of ``bin_width``.

These two axes affect different architectural concerns, so MaldiDeepKit
scales each knob with whichever quantity it actually depends on. This
page documents that semantic split and the :meth:`from_spectrum`
factory that applies it.

Why the defaults look the way they do
-------------------------------------

Default conv-kernel width (CNN first layer, ResNet stem) and patch size
(Transformer) balance two forces:

- **Local feature extraction.** A small kernel or patch lets early
  layers capture short-range correlations between adjacent bins without
  collapsing distinct local patterns into a single value.
- **Global feature integration.** Keeping the kernel / patch small
  leaves enough depth (for conv stacks) or token-sequence length (for
  attention and SSM) for deeper components to integrate long-range
  structure.

The specific reference values (``kernel_size=7``, ``patch_size=4``) are
inherited from well-tested image-domain defaults (ResNet-18 stem,
ViT patch).

Empirically, these values also align cleanly with the **peak width
distribution** on DRIAMS spectra at ``bin_width=3``. We measured the
full width at half maximum (FWHM) of all detected peaks on 50 random
spectra per species-drug pair (``scipy.signal.find_peaks`` with
prominence threshold, ``scipy.signal.peak_widths`` at ``rel_height=0.5``):

.. list-table:: Peak FWHM on DRIAMS (Da, ``bin_width=3``)
   :header-rows: 1
   :widths: 30 12 10 10 10 10 10

   * - Pair
     - Spectra
     - P10
     - P25
     - Median
     - P75
     - P90
   * - *S. aureus* / oxacillin
     - 4874
     - 6.6
     - 9.0
     - **11.8**
     - 15.1
     - **21.3**
   * - *E. coli* / ceftriaxone
     - 8080
     - 6.3
     - 8.7
     - **11.8**
     - 15.4
     - **21.5**
   * - *K. pneumoniae* / ceftriaxone
     - 5512
     - 6.3
     - 8.6
     - **11.8**
     - 15.2
     - **21.1**

The distribution is strikingly consistent across the three species: the
median peak is **~12 Da** wide and the 90th-percentile peak is **~21 Da**.
This is *why* the image-domain defaults happen to be a natural fit:

- ``patch_size=4`` (12 Da at ``bin_width=3``) ≈ one median peak per
  token, so each Transformer token captures roughly one peak's worth
  of shape.
- ``kernel_size=7`` (21 Da) ≈ 90th-percentile peak FWHM, so the first
  CNN / ResNet conv fully contains ~90% of peaks along with one or two
  bins of shoulder on either side.

This is **empirical observation, not a priori design**. The defaults
were picked for the local-vs-global balance first; the alignment with
DRIAMS peak widths emerged from the measurement above. Different
instruments, different preprocessing, or different bin widths will
shift the distribution - use :meth:`from_spectrum` to scale.

The semantic split
------------------

When the spectrum layout changes, each architectural knob is driven
by the quantity it physically depends on:

.. list-table::
   :header-rows: 1
   :widths: 28 26 46

   * - Knob
     - Driven by
     - Why
   * - CNN ``kernel_size``
     - ``bin_width``
     - Kernels aggregate per-bin information density. A finer bin grid
       carries less information per bin, so a wider kernel in bins is
       needed to gather the same local structure; a coarser grid calls
       for a smaller kernel.
   * - ResNet ``stem_kernel_size``
     - ``bin_width``
     - Same argument as for the CNN: the first layer's job is to
       aggregate adjacent bin values, which is a per-bin-density
       concern.
   * - Transformer ``patch_size``
     - (none: scale-agnostic)
     - The Transformer uses a learned positional embedding sized to
       whatever token count the patch embedding produces, so any
       ``input_dim`` works with the default ``patch_size=4`` without
       tuning. ``from_spectrum`` just forwards ``input_dim``.

The MLP has no spectral-layout-dependent knob either: its first layer
is a single ``Linear(input_dim, hidden_dim)`` that handles any input
size identically.

Auto-scaling: ``from_spectrum``
-------------------------------

Each classifier exposes a ``from_spectrum`` classmethod that applies
the semantic split automatically:

.. code-block:: python

   Classifier.from_spectrum(bin_width: int, input_dim: int, **overrides)

Both parameters describe the spectrum layout. Each classifier uses the
relevant one internally:

- ``MaldiCNNClassifier`` and ``MaldiResNetClassifier`` scale their conv
  kernel using ``bin_width``.
- ``MaldiTransformerClassifier`` and ``MaldiMLPClassifier`` are
  architecturally scale-agnostic and just record ``input_dim`` for
  shape validation; ``from_spectrum`` forwards overrides untouched.

Any keyword in ``**overrides`` wins over the auto-scaled default.

.. code-block:: python

   from maldideepkit import (
       MaldiCNNClassifier,
       MaldiTransformerClassifier,
   )

   # Reference layout: kernel_size=7, patch_size=4
   cnn = MaldiCNNClassifier.from_spectrum(bin_width=3, input_dim=6000)
   tr = MaldiTransformerClassifier.from_spectrum(bin_width=3, input_dim=6000)

   # Trim the spectrum (bin_width unchanged, input_dim halved).
   # CNN kernel stays at 7 (bin density unchanged).
   # Transformer unchanged (scale-agnostic).
   cnn_trim = MaldiCNNClassifier.from_spectrum(bin_width=3, input_dim=3000)
   tr_trim = MaldiTransformerClassifier.from_spectrum(bin_width=3, input_dim=3000)

   # Coarser binning over the full range.
   # CNN kernel drops to 5 (fewer bins carry more info per bin).
   cnn_coarse = MaldiCNNClassifier.from_spectrum(bin_width=6, input_dim=3000)

   # Fine bins, trimmed.
   # CNN kernel up to 21 (fine bins -> larger kernel to aggregate).
   cnn_fine = MaldiCNNClassifier.from_spectrum(bin_width=1, input_dim=4000)

   # Override the auto-choice explicitly.
   cnn_custom = MaldiCNNClassifier.from_spectrum(
       bin_width=3, input_dim=6000, kernel_size=(11, 7, 5, 3),
   )

Scaling helpers
---------------

The underlying helper is public and lives under
``maldideepkit._bin_scaling``:

- :func:`maldideepkit._bin_scaling.scale_odd_kernel` returns an odd
  kernel size closest to ``reference_kernel * reference_bin_width /
  bin_width``, clamped ``>= 3``. Reference: kernel 7 at
  ``bin_width=3``. Used by CNN and ResNet kernels.

The Transformer's ``patch_size`` is deliberately *not* scaled by
layout: a learned positional embedding sizes itself to whatever token
count the patch embedding produces, so the same ``patch_size=4`` works
across bin widths and trimmed m/z ranges.

Per-block configurability
-------------------------

For reviewer-driven ablations, :class:`~maldideepkit.MaldiCNNClassifier`
accepts a scalar *or* a per-block tuple for both ``kernel_size`` and
``pool_size``:

.. code-block:: python

   # Scalar: broadcast to every block
   MaldiCNNClassifier(channels=(32, 64, 128, 128), kernel_size=7)

   # Tuple: per-block progression
   MaldiCNNClassifier(
       channels=(32, 64, 128, 128),
       kernel_size=(11, 7, 5, 3),
       pool_size=(2, 2, 2, 4),
   )

:class:`~maldideepkit.MaldiResNetClassifier` exposes
``stem_kernel_size``, ``stem_stride``, ``block_kernel_size``, and
``use_stem_pool`` as explicit keyword arguments. Defaults are
peak-friendly for MALDI-TOF (``stem_stride=1``, ``block_kernel_size=7``,
``use_stem_pool=False``) and deviate from the literal ResNet-18
backbone. To reproduce the literal configuration, set
``stem_stride=2``, ``block_kernel_size=3``, ``use_stem_pool=True``.

Input-dim sensitivity of the flat dense head
--------------------------------------------

Only :class:`~maldideepkit.MaldiCNNClassifier` has a flat dense head
whose input width scales linearly with ``input_dim``. Given four
``pool_size=2`` blocks and ``channels[-1]=128``, the flat head width is
``128 * input_dim / 16``. For ``input_dim=6000`` this is already
48,000 units; at ``input_dim=18000`` it grows to 144,000. If this
matters, prefer :class:`~maldideepkit.MaldiResNetClassifier` or
:class:`~maldideepkit.MaldiTransformerClassifier` - both use a pooled
head whose width is independent of ``input_dim``.