Irregular Time-Series Papers

Time Series as Images: Vision Transformer for Irregularly Sampled Time Series (Code)

There are numerous algorithms (LSTM, TCN, Transformer) for time series modeling. (mainly for regular intervals and fixed-size numerical inputs) Models have been developed for irregular time series but they are highly specialized, requires substantial prior knowledge and efforts in model architecture selection and algorithm design

  • Main idea:
    • Transform irregularly sampled multivariate time series into line graphs
    • organize them into a RGB image format
    • Fine-tune a pre-trained vision transformer for classification using those images
  • Has superior performance over SoTA methods specifically designed for irregularly sampled time series.
  • Has strong robustness to missing observations. Surpasses the previous leading solution by 42% in absolute F1 score when half of the variables are masked in the test set.

  • Approach
    • Transform multivariate time series into a concatenated line graph image
    • Use a pre-trained vision transformer as an image classifier.

Data = $\{( S_i, y_i )|i = 1, · · · , N \}$, $N$ is the number of classes, $S_i$ is a Data sample (Can contain at most D types of observations, some may have no observation) $y_i ∈ \{1, · · · , C\},$ $C$ is no. of classes

  • Image creation
    • Plot the line graph for each variable (The scales of each line graph $g_{i,d}$ are kept the same across different time series $S_i$)
    • Used grid size of $l \times l$ or $l \times (l+1)$ based on the maximum number of variables present
    • Any grid not occupied by a line graph is kept empty
    • They studied the effects of different marker , line thickness, line types in the graph, the order of variables in the graph, and the colors used to represent lines for different variables.
  • Vision Transformer for Time series modeling
    • Vision Transformer (ViT) - Originally adopted from NLP
    • Input image is split into fixed-sized patches, and each patch is linearly embedded and augmented with position embedding.
    • The paper used Swin Transformer to reduce computational complexity.
    • They mention that the Swin transformer could transfer knowledge obtained from pre-training on natural images to their synthetic time series line graph, as the performance was drastically low without pre-training.
  • Tasks:
    • Sepsis prediction on P19 dataset
    • Mortality Prediction on P12 dataset
    • physical activity classification on PAM dataset.

Static features like demographic information, weight, were first converted to natural language sentences and then encoded using a text encoder( RoBERTa-base), and then concatenated with the image embeddings obtained from the vision transformer to perform classification.


ContiFormer: Continuous-Time Transformer for Irregular Time Series Modeling (Code) (Summary)

  • Main contributions:
    • Incorporate a continuous-time mechanism into attention calculation in the Transformer; This captures the continuity of the underlying system of the irregularly sampled time-series data.
    • Proposes a reparameterization technique that allows for the execution of the continuous-time attention in different time ranges in parallel.
    • Provides a general framework for other Transformer variants as a special case of ContiFormer.
    • Outperforms existing models in time-series interpolation, classification, and prediction.
  • How?
    • takes as input an irregular time series X and the sampled time ω and outputs a latent continuous trajectory that captures the dynamic change of the underlying system.

Continuous-Time Multi-Head Attention Mechanism

  • Transform input irregular time series $ X $ into: $ Q = [Q_1; Q_2; \ldots; Q_N] \text{(queries)}, K = [K_1; K_2; \ldots; K_N] \quad \text{(keys)}, V [V_1; V_2; \ldots; V_N] \quad \text{(values)} $
  • Use ODEs to define latent trajectories for each observation.
  • Define keys and values as:
\[k_i(t_i) = K_i, \qquad k_i(t) = k_i(t_i) + \int_{t_i}^{t} f\left( \tau, k_i(\tau); \theta_k \right) d\tau \tag{1}\] \[v_i(t_i) = V_i, \qquad v_i(t) = v_i(t_i) + \int_{t_i}^{t} f\left( \tau, v_i(\tau); \theta_v \right) d\tau\]

where $ t \in [t_1, t_N] $, and $ k_i(\cdot), v_i(\cdot) \in \mathbb{R}^d $ represent ordinary differential equations (ODEs) for the $ i $-th observation with parameters $ \theta_k $ and $ \theta_v $. The function $ f(\cdot): \mathbb{R}^{d+1} \rightarrow \mathbb{R}^d $ controls the change in dynamics.

  • Define a closed-form continuous-time interpolation function with knots at $ t_1, \ldots, t_N $ is used, we define: $q(t_i) = Q_i$ as an approximation of the underlying process.

  • Scaled Dot Product

\[\alpha_i(t) = \frac{1}{t - t_i} \int_{t_i}^{t} q(\tau) \cdot k_i(\tau)^\top d\tau \tag{3}\]
  • Expected Value
\[\hat{v}_i(t) = \mathbb{E}_{\tau \sim [t_i, t]}[v_i(\tau)] = \frac{1}{t - t_i} \int_{t_i}^{t} v_i(\tau) \, d\tau \tag{5}\]

Multi-Head Attention

Given the predefined queries, keys, and values in continuous-time space, the continuous-time attention at query time $ t $ is:

\[\text{CT-ATTN}(Q, K, V, \omega)(t) = \sum_{i=1}^{N} \hat{\alpha}_i(t) \cdot \hat{v}_i(t)\]

Where:

\[\hat{\alpha}_i(t) = \frac{\exp\left( \alpha_i(t) / \sqrt{d_k} \right)}{\sum_{j=1}^{N} \exp\left( \alpha_j(t) / \sqrt{d_k} \right)}\]

To incorporate multiple heads:

\[\text{CT-MHA}(Q, K, V, \omega)(t) = \text{Concat}(\text{head}^{(1)}(t), \ldots, \text{head}^{(H)}(t)) W^O \tag{7}\]

Each head is defined as:

\[\text{head}^{(h)}(t) = \text{CT-ATTN}(Q W_Q^{(h)}, K W_K^{(h)}, V W_V^{(h)}, \omega)(t)\]

Where $ W^O, W_Q^{(h)}, W_K^{(h)}, W_V^{(h)} $ are learnable projection matrices, and $ h \in [1, H] $ is the head index.

  • Continuous-Time Transformer Layer
\[\tilde{z}^{l}(t) = \text{LN}\left( \text{CT-MHA}(X^l, X^l, X^l, \omega^l)(t) + x^l(t) \right) \tag{8}\] \[z^l(t) = \text{LN}\left( \text{FFN}(\tilde{z}^l(t)) + \tilde{z}^l(t) \right)\]

Where:

  • $ z^l(t) $ is the output from the $ l $-th ContiFormer layer at time $ t $
  • $ x^l(t) $ is a continuous interpolation of the discrete input $ X^l $

Sampling Process

To stack ContiFormer layers:

  • Reference time points are chosen for each layer’s output
  • These points can either be input timestamps or task-specific time points
  • They are used to discretize the continuous outputs from each layer

Complexity Analysis

To approximate attention integrals efficiently:

  • Reparameterize the time domain to a fixed interval $[-1, 1]$
  • Apply Gauss-Legendre quadrature for fast numerical integration

Experiment Setup

  • Use natural cubic splines to interpolate the query function into continuous time

  • Tasks:

    • Interpolation and extrapolation of time series
    • Irregular time series classification
    • Event prediction (Marked Temporal Point Processes)
    • Regular time series forecasting

Modeling a Continuous-Time Function

  • 300 two-dimensional spirals were generated
  • Each spiral was sampled at 150 evenly spaced time points
  • 50 irregular time points were randomly sampled per spiral
  • ContiFormer significantly outperformed both Transformer and Latent ODE on this task

Irregular Time Series Classification

  • 20 datasets from the UEA Time Series Classification Archive were used
  • 30%, 50%, and 70% of observations were randomly dropped to simulate irregularity
  • ContiFormer outperformed all baselines on all three stages

Predicting Irregular Event Sequences (MTPP)

  • Synthetic Dataset
    • 10 event types, each with 3 properties
    • Predict the occurrence time of the next event and the type of the next event
  • Neonate
    • Predict when the next seizure will occur
  • Traffic (PeMS)
    • Predict when a traffic spike or drop will occur
    • Predict whether the change is upward or downward
  • MIMIC
    • Likely task: predict the time and type of the next clinical event
  • BookOrder
    • Predict the time of the next stock trade (buy/sell)
  • StackOverflow
    • Predict the type of badge a user will receive and when

Dataset link: Google Drive

Regular Time Series Forecasting

  • Datasets: ETT, Exchange, Weather, and ILI
  • ContiFormer performs competitively with other state-of-the-art models on regularly sampled time series as well