Irregular Time-Series Papers
Time Series as Images: Vision Transformer for Irregularly Sampled Time Series (Code)
There are numerous algorithms (LSTM, TCN, Transformer) for time series modeling. (mainly for regular intervals and fixed-size numerical inputs) Models have been developed for irregular time series but they are highly specialized, requires substantial prior knowledge and efforts in model architecture selection and algorithm design
- Main idea:
- Transform irregularly sampled multivariate time series into line graphs
- organize them into a RGB image format
- Fine-tune a pre-trained vision transformer for classification using those images
- Has superior performance over SoTA methods specifically designed for irregularly sampled time series.
-
Has strong robustness to missing observations. Surpasses the previous leading solution by 42% in absolute F1 score when half of the variables are masked in the test set.
- Approach
- Transform multivariate time series into a concatenated line graph image
- Use a pre-trained vision transformer as an image classifier.
Data = $\{( S_i, y_i )|i = 1, · · · , N \}$, $N$ is the number of classes, $S_i$ is a Data sample (Can contain at most D types of observations, some may have no observation) $y_i ∈ \{1, · · · , C\},$ $C$ is no. of classes
- Image creation
- Plot the line graph for each variable (The scales of each line graph $g_{i,d}$ are kept the same across different time series $S_i$)
- Used grid size of $l \times l$ or $l \times (l+1)$ based on the maximum number of variables present
- Any grid not occupied by a line graph is kept empty
- They studied the effects of different marker , line thickness, line types in the graph, the order of variables in the graph, and the colors used to represent lines for different variables.
- Vision Transformer for Time series modeling
- Vision Transformer (ViT) - Originally adopted from NLP
- Input image is split into fixed-sized patches, and each patch is linearly embedded and augmented with position embedding.
- The paper used Swin Transformer to reduce computational complexity.
- They mention that the Swin transformer could transfer knowledge obtained from pre-training on natural images to their synthetic time series line graph, as the performance was drastically low without pre-training.
- Tasks:
- Sepsis prediction on P19 dataset
- Mortality Prediction on P12 dataset
- physical activity classification on PAM dataset.
Static features like demographic information, weight, were first converted to natural language sentences and then encoded using a text encoder( RoBERTa-base), and then concatenated with the image embeddings obtained from the vision transformer to perform classification.
ContiFormer: Continuous-Time Transformer for Irregular Time Series Modeling (Code) (Summary)
- Main contributions:
- Incorporate a continuous-time mechanism into attention calculation in the Transformer; This captures the continuity of the underlying system of the irregularly sampled time-series data.
- Proposes a reparameterization technique that allows for the execution of the continuous-time attention in different time ranges in parallel.
- Provides a general framework for other Transformer variants as a special case of ContiFormer.
- Outperforms existing models in time-series interpolation, classification, and prediction.
- How?
- takes as input an irregular time series X and the sampled time ω and outputs a latent continuous trajectory that captures the dynamic change of the underlying system.
Continuous-Time Multi-Head Attention Mechanism
- Transform input irregular time series $ X $ into: $ Q = [Q_1; Q_2; \ldots; Q_N] \text{(queries)}, K = [K_1; K_2; \ldots; K_N] \quad \text{(keys)}, V [V_1; V_2; \ldots; V_N] \quad \text{(values)} $
- Use ODEs to define latent trajectories for each observation.
- Define keys and values as:
where $ t \in [t_1, t_N] $, and $ k_i(\cdot), v_i(\cdot) \in \mathbb{R}^d $ represent ordinary differential equations (ODEs) for the $ i $-th observation with parameters $ \theta_k $ and $ \theta_v $. The function $ f(\cdot): \mathbb{R}^{d+1} \rightarrow \mathbb{R}^d $ controls the change in dynamics.
-
Define a closed-form continuous-time interpolation function with knots at $ t_1, \ldots, t_N $ is used, we define: $q(t_i) = Q_i$ as an approximation of the underlying process.
-
Scaled Dot Product
- Expected Value
Multi-Head Attention
Given the predefined queries, keys, and values in continuous-time space, the continuous-time attention at query time $ t $ is:
\[\text{CT-ATTN}(Q, K, V, \omega)(t) = \sum_{i=1}^{N} \hat{\alpha}_i(t) \cdot \hat{v}_i(t)\]Where:
\[\hat{\alpha}_i(t) = \frac{\exp\left( \alpha_i(t) / \sqrt{d_k} \right)}{\sum_{j=1}^{N} \exp\left( \alpha_j(t) / \sqrt{d_k} \right)}\]To incorporate multiple heads:
\[\text{CT-MHA}(Q, K, V, \omega)(t) = \text{Concat}(\text{head}^{(1)}(t), \ldots, \text{head}^{(H)}(t)) W^O \tag{7}\]Each head is defined as:
\[\text{head}^{(h)}(t) = \text{CT-ATTN}(Q W_Q^{(h)}, K W_K^{(h)}, V W_V^{(h)}, \omega)(t)\]Where $ W^O, W_Q^{(h)}, W_K^{(h)}, W_V^{(h)} $ are learnable projection matrices, and $ h \in [1, H] $ is the head index.
- Continuous-Time Transformer Layer
Where:
- $ z^l(t) $ is the output from the $ l $-th ContiFormer layer at time $ t $
- $ x^l(t) $ is a continuous interpolation of the discrete input $ X^l $
Sampling Process
To stack ContiFormer layers:
- Reference time points are chosen for each layer’s output
- These points can either be input timestamps or task-specific time points
- They are used to discretize the continuous outputs from each layer
Complexity Analysis
To approximate attention integrals efficiently:
- Reparameterize the time domain to a fixed interval $[-1, 1]$
- Apply Gauss-Legendre quadrature for fast numerical integration
Experiment Setup
-
Use natural cubic splines to interpolate the query function into continuous time
-
Tasks:
- Interpolation and extrapolation of time series
- Irregular time series classification
- Event prediction (Marked Temporal Point Processes)
- Regular time series forecasting
Modeling a Continuous-Time Function
- 300 two-dimensional spirals were generated
- Each spiral was sampled at 150 evenly spaced time points
- 50 irregular time points were randomly sampled per spiral
- ContiFormer significantly outperformed both Transformer and Latent ODE on this task
Irregular Time Series Classification
- 20 datasets from the UEA Time Series Classification Archive were used
- 30%, 50%, and 70% of observations were randomly dropped to simulate irregularity
- ContiFormer outperformed all baselines on all three stages
Predicting Irregular Event Sequences (MTPP)
- Synthetic Dataset
- 10 event types, each with 3 properties
- Predict the occurrence time of the next event and the type of the next event
- Neonate
- Predict when the next seizure will occur
- Traffic (PeMS)
- Predict when a traffic spike or drop will occur
- Predict whether the change is upward or downward
- MIMIC
- Likely task: predict the time and type of the next clinical event
- BookOrder
- Predict the time of the next stock trade (buy/sell)
- StackOverflow
- Predict the type of badge a user will receive and when
Dataset link: Google Drive
Regular Time Series Forecasting
- Datasets: ETT, Exchange, Weather, and ILI
- ContiFormer performs competitively with other state-of-the-art models on regularly sampled time series as well