ChronoEpilogi
__init__(data: pd.DataFrame, target: str | tuple[str, str], phases: str = 'FB', equivalence_early_stopping: bool = True, forward_test_threshold: float = 0.05, backward_test_threshold: float = 0.05, equivalence_test_threshold: float = 0.05, equivalence_correlation_threshold: float = 0.05, equivalence_heuristic: str = 'parcorr', maximal_selected_size: float = np.inf, model_class: None | LearningModel = None, model_config: None | dict = None, association_class: None | Association = None, association_config: None | dict = None, partial_correlation_class: None | PartialCorrelation = None, partial_correlation_config: None | dict = None, start_with_univariate_autoregressive_model: bool | str = 'infer', model_test_method: None | str = None, target_type: str = 'continuous', default_k: int = 1, default_max_lag: int = 1, variable_types: None | dict = None, backward_removal_strategy: str = 'first', valid_obs_param_ratio: float = 0.0) -> None
Initialize ChronoEpilogi.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame containing the multivariate time series. The index of the DataFrame should correspond to timesteps, and the columns to covariates. The column index may have one or two levels. See Notes. |
required |
target
|
str | tuple[str, str]
|
The forecasting/regression target, as the name of the column in the provided DataFrame. Whether a single string or a tuple of two strings depends on the columns levels in data. See Notes. |
required |
phases
|
str
|
Successions of phases to use. Should be one of "F","FB","Fg","FgV","FBG","FBGV","FBE","FBEV". F stands for forward phase, B for backward phase, E for equivalence phase, V for verification phase. Alternatively, equivalences can be checked during the forward phase with Fg and FgV. Recommanded choice should be "FB" for the selection of a single set of TS, and "FBEV" for computing equivalences. |
'FB'
|
equivalence_early_stopping
|
bool
|
Set to True to use the Early Stopping heuristic during the equivalence phase. This heuristic checks equivalences to a selected TS in decreasing order of correlation with the residuals, and skips testing equivalences after the first non-equivalent TS. Recommanded choice would be True. |
True
|
forward_test_threshold
|
float
|
Threshold of the model difference metric (returned by models.LearningModel.stopping_metric), Value used during the forward phase only. A lower threshold corresponds to stricter tests of performance increase, and leads to a smaller selected set after the forward phase. |
0.05
|
backward_test_threshold
|
float
|
Threshold of model equivalence used during the backward phase. A lower threshold leads to more removals from the selected set. |
0.05
|
equivalence_test_threshold
|
float
|
Threshold of model equivalence / partial correlation used during the equivalence phase. A lower threshold leads to more detected equivalences. |
0.05
|
equivalence_correlation_threshold
|
float
|
Threshold of a correlation test during the equivalence phase when equivalence_heuristic is set to "parcorr". A lower threshold leads to less detected equivalences. |
0.05
|
equivalence_heuristic
|
str
|
Version of the equivalence detection test to use. Should be one of "exact", "resid", "parcorr". "exact" uses full models tests to test X equiv(T) Y | Z, leading to high computation times. "resid" (heuristic) replaces the above test by X equiv(Residuals(T~Z)) Y. It removes the dependency on the size of the selected set, here represented by Z. "parcorr" (heuristic) is a composite test attending to a subset of lags individually. It removes the dependency on the size of the selected set and on the number of lags (or size of the second column level). Recommanded choice would be "parcorr". |
'parcorr'
|
maximal_selected_size
|
float
|
Bounds the maximal number of covariates to include during the forward phase. Default value is set to np.inf, so the forward phase only end when model equivalence is reached. |
inf
|
Other Parameters:
| Name | Type | Description |
|---|---|---|
model_class |
None | LearningModel
|
The user may provide a custom model inheriting from LearningModel, suited to the specific task and data. When left to None, a default model is infered using arguments target_type, start_with_univariate_autoregressive_model, and default_max_lag. It is recommanded to pass explicitely a model class and its arguments. |
model_config |
None | dict
|
Configuration parameter dictionary to pass to the model class. If model_class is None, model_config will be infered similarily. |
association_class |
None | Association
|
The user may provide a custom association inheriting from Association, suited to the specific task and data. When left to None, a default association is infered. It is recommanded to pass explicitely an association class and its arguments. |
association_config |
None | dict
|
Configuration parameter dictionary to pass to the association class. If association_class is None, association_config will be infered depending on default_max_lag. |
partial_correlation_class |
None | PartialCorrelation
|
The user may provide a custom association inheriting from PartialCorrelation, suited to the specific task and data. When left to None, a default partial correlation is infered. It is recommanded to pass explicitely an partial correlation class and its arguments. |
partial_correlation_config |
None | dict
|
Configuration parameter dictionary to pass to the partial correlation class. If partial_correlation_class is None, partial_correlation_config will be infered depending on default_max_lag and default_k. |
start_with_univariate_autoregressive_model |
bool | str
|
Whether the forecasting/regression task should include the past of the forecasted quantity. Value can be True for an autoregressive model, False to exclude the past of the forecasted series. Value "infer" sets it to True for single-level column index. For double-level column index, should be set to "infer" or False, never True. |
model_test_method |
None | str
|
Parameter passed to the stopping_metric method of the LearningModel. |
target_type |
str
|
Whether the target is "continuous", "count" or "binary". This parameter is only used when model_class is None, to infer the type of model to use. |
default_k |
int
|
The number of individual lags/second level columns that are compared by the parcorr equivalence heuristic. This parameter is only used when partial_correlation_class is None and equivalence_heuristic is "parcorr". |
default_max_lag |
int
|
The size of the lag window. This parameter is only used when the data has a single-level column index, and either model_class, association_class or partial_correlation_class is None. |
variable_types |
None | dict
|
For each column (first level column), variable_types[column] is one of "numeric" or "categorical". When set to None, or when missing columns compared to the provided data, type "numeric" is used for all missing columns. This parameter is only used when either association_class or partial_correlation_class is None. |
backward_removal_strategy |
str
|
Whether to remove the first found redundant covariate ("first") or the most redundant covariate ("max") during the backward phase. Recommanded value is "first". |
valid_obs_param_ratio |
float
|
Parameter passed to the has_too_many_parameters method of the LearningModel. May be deprecated in the future. |
Returns:
| Type | Description |
|---|---|
None
|
|
Notes
When the input DataFrame has a single level column index, it is assumed that the data is in the prefered form for forecasting. The forecasting scenario is to predict time t of the target, given a window of covariates from time t-1 to t-lags. The windowing operation is left to the modules (model, association, partial correlation).
When the input DataFrame has a two level column index, it is assumed that the data is in the prefered form for tabular data regression. The regression scenario is to predict time t of the target, given time t of covariates. Hence, each row correspond to a pair (input, output), similarily to tabular data. In that situation, columns are grouped according to the first level column index. Hence, ChronoEpilogi selects groups of columns according the the first level column index. When a first level column is selected, all corresponding second level columns are included in the new model. The role of the second level columns is similar to the role of the lag window for single level data. The "parcorr" heuristic (see parameter equivalence_heuristic) attends to a subset of those second level columns instead of lags.
In fact, single-level DataFrame can be transformed to double-level DataFrame. It suffices to create a column for each lag of each covariate, labeling lag on the second level index and original column as first level index.
fit(data: None | pd.DataFrame = None, config: None | dict = None) -> None
Runs the ChronoEpilogi algorithm.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
None | DataFrame
|
Updates the 2D pandas DataFrame containing the data for ChronoEpilogi. In case a dataframe is provided, resets all learned structures and run ChronoEpilogi from scratch. |
None
|
config
|
None | dict
|
Updates to the parameters of ChronoEpilogi. Keys corresponds to keyword arguments of the init method. The learned structures that depend on any key in the new parameters are reset. See Notes and Examples. |
None
|
Returns:
| Type | Description |
|---|---|
None
|
|
Notes
The possibility to pass a configuration to the fit methods allows easier hyperparameter search, without having to recompute from scratch the entire algorithm. For instance, changing the equivalence_threshold may only modify the learned equivalences, but does not affect model and association computations. Hence, running ChronoEpilogi again with a different equivalent threshold does not require recomputing models and associations.
We reset minimally the previously learned structures depending on the keywords in the new configuration. - Parameters that affect models affect all structures, hence require running ChronoEpilogi from scratch. - Parameters that affect associations lead to the reset of the computed associations. - Parameters that affect partial correlations lead to the reset of the computed partial correlations. - Thresholds, phases and heuristics do not affect the learned structures.
Examples:
get_equivalence_classes() -> List[List[str]]
Returns the list of equivalence classes.
Returns:
| Name | Type | Description |
|---|---|---|
eq_classes |
List[List[str]]
|
The equivalence classes, each a list of TS names (one level column data) or group names (two levels column data). |
Examples:
>>> rng = np.random.default_rng(0)
>>> data = pd.DataFrame(rng.random(size=(10001,100)),columns=list(map(str,range(100))))
>>> data.loc[1:,"0"] = data["1"].shift(1) + data["2"].shift(1)
>>> data["3"] = 0.4*data["1"]+0.3
>>> tss_instance = ChronoEpilogi(data, "0", phases="FBEV")
>>> tss_instance.fit()
>>> tss_instance.get_equivalence_classes()
[['0'], ['2'], ['1', '3']]
get_first_markov_boundary() -> List[str]
Returns the Markov Boundary computed during the forward-backward phases.
Returns:
| Name | Type | Description |
|---|---|---|
markov_boundary |
list[str]
|
The Markov Boundary as a list of TS names (for one-level column index) or group names (for two-levels column index). |
Examples:
>>> rng = np.random.default_rng(0)
>>> data = pd.DataFrame(rng.random(size=(10001,100)),columns=list(map(str,range(100))))
>>> data.loc[1:,"0"] = data["1"].shift(1) + data["2"].shift(1)
>>> tss_instance = ChronoEpilogi(data, "0")
>>> tss_instance.fit()
>>> tss_instance.get_first_markov_boundary()
['0', '1', '2']
get_markov_boundary_from_index(index: int) -> List[str]
Returns the Markov Boundary corresponding to the provided index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int
|
The index of the Markov Boundary. Must be between 0 and self.get_total_number_sets()-1 included. Each index corresponds to a unique Markov Boundary. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
markov_boundary |
list[str]
|
The Markov Boundary as a list of TS names (for one-level column index) or group names (for two-levels column index). |
Examples:
>>> rng = np.random.default_rng(0)
>>> data = pd.DataFrame(rng.random(size=(10001,100)),columns=list(map(str,range(100))))
>>> data.loc[1:,"0"] = data["1"].shift(1) + data["2"].shift(1)
>>> data["3"] = 0.4*data["1"]+0.3
>>> tss_instance = ChronoEpilogi(data, "0", phases="FBEV")
>>> tss_instance.fit()
>>> tss_instance.get_markov_boundary_from_index(0), tss_instance.get_markov_boundary_from_index(1)
(['0', '2', '1'], ['0', '2', '3'])
get_total_number_markov_boundaries() -> int
Returns the total number of Markov Boundaries.
Returns:
| Name | Type | Description |
|---|---|---|
total |
int
|
The number of Markov Boundaries computed during the equivalence phase. |
Notes
When the Markov Boundaries are represented as a set of equivalence classes, the number of MB is the product of the size of each equivalence class.
Examples:
>>> rng = np.random.default_rng(0)
>>> data = pd.DataFrame(rng.random(size=(10001,100)),columns=list(map(str,range(100))))
>>> data.loc[1:,"0"] = data["1"].shift(1) + data["2"].shift(1)
>>> data["3"] = 0.4*data["1"]+0.3
>>> tss_instance = ChronoEpilogi(data, "0", phases="FBEV")
>>> tss_instance.fit()
>>> tss_instance.get_total_number_markov_boundaries()
2