Association
TemporalSlowAssociation
Bases: Association
Temporal data mixed-type association.
Notes
For continuous data, we use Pearson Correlation with mass implementation. For categorical data, we use an ANOVA test between the residuals and the tested series.
__init__(config: dict)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict
|
Must contain an entry for: - "lags": int, the number of lags to compute the correlation over - "categorical_method": str, any of 'f_oneway', 'kruskal', 'alexandergovern'. This specifies the kind of test used for categorical data. - "numerical_method": str, any of 'pearsonr', 'spearmanr'. - "variable_types": dict, for each variable name, whether it is "numerical" or "categorical". See examples. - "n_jobs": int, the number of processors used in parallel. Must be different from 0. See joblib.Parallel for more information. - "check_na": bool, if True, checks that there is no NaN in the variables and residuals DataFrames. |
required |
Returns:
| Type | Description |
|---|---|
None
|
|
Examples:
>>> data = pd.DataFrame(np.random.random(size=(1000,5)),columns=["target","1","2","3","4"])
>>> variable_types = dict([(column, "numerical") for column in data.columns])
>>> asso = TemporalSlowAssociation({"lags":10,"categorical_method":"f_oneway","variable_types":variable_types})
>>> asso
Or with mixed types:
>>> numerical = pd.DataFrame(np.random.random(size=(1000,3)),columns=["target","1","2"])
>>> categorical = pd.DataFrame(np.random.randint(size=(1000,2)),columns=["3","4"])
>>> data = pd.concat([numerical,categorical], axis="columns")
>>> variable_types = {"target":"numerical","1":"numerical","2":"numerical","3":"categorical","4":"categorical"}
>>> asso = TemporalSlowAssociation({"lags":10,"categorical_method":"f_oneway","variable_types":variable_types})
>>> asso
association(residuals_df: pd.DataFrame, variables_df: pd.DataFrame) -> np.array
Computes the association score between the residuals and candidate time series.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
residuals_df
|
DataFrame
|
DataFrame of shape (ntimesteps, 1) containing the model residuals of a learning model. The index must be aligned with variables_df. |
required |
variables_df
|
DataFrame
|
DataFrame of shape (ntimesteps, D) containing the D time series to test for association with the residuals. The index must be aligned with residuals_df |
required |
Returns:
| Name | Type | Description |
|---|---|---|
pvalues |
array
|
A 1D numpy array containing minus the minimal p-value across lags, for each of the D time series to test. The coefficients are in the same order as the columns in variables_df.columns. We return minus the p-value by convention, as the maximal -pvalue correspond to the maximal association. |
Examples:
>>> rng = np.random.default_rng(0)
>>> data = pd.DataFrame(rng.random(size=(1000,5)),columns=["target","1","2","3","4"])
>>> variable_types = dict([(column, "numerical") for column in data.columns])
>>> asso = TemporalSlowAssociation({"lags":10,"categorical_method":"f_oneway","variable_types":variable_types})
>>> asso.association(data[["target"]], data[["1","2","3","4"]])
array([-0.03384917, -0.02838155, -0.0633841 , -0.15107386])
Or with mixed types:
>>> rng = np.random.default_rng(0)
>>> numerical = pd.DataFrame(rng.random(size=(1000,3)),columns=["target","1","2"])
>>> categorical = pd.DataFrame(rng.integers(0,3,size=(1000,2)),columns=["3","4"])
>>> data = pd.concat([numerical,categorical], axis="columns")
>>> variable_types = {"target":"numerical","1":"numerical","2":"numerical","3":"categorical","4":"categorical"}
>>> asso = TemporalSlowAssociation({"lags":10,"categorical_method":"f_oneway","variable_types":variable_types})
>>> asso.association(data[["target"]], data[["1","2","3","4"]])
array([-0.03111284, -0.04568282, -0.03302831, -0.02551908])
CrossSectionalAssociation
Bases: Association
Cross-sectional, mixed-type, grouped data association.
Notes
This class is intended for use with two-level column index dataframes. The first level corresponds to groups of features, over which the association is computed. See documentation on data format for precisions.
For continuous data, we use Pearson Correlation. For categorical data, we use an ANOVA test between the residuals and the tested series.
__init__(config: dict)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict
|
Must contain an entry for: - "categorical_method": str, any of 'f_oneway', 'kruskal', 'alexandergovern'. This specifies the kind of test used for categorical data. - "numerical_method": str, any of 'pearsonr', 'spearmanr'. This specifies the kind of test used for numerical data. - "variable_types": dict, for each group name (first level of the column index), whether it is "numerical" or "categorical". This implies that all columns in a group must belong to the same type (numerical or categorical). See examples. - "n_jobs": int, the number of jobs for parallelism. See joblib.Parallel for details. |
required |
Returns:
| Type | Description |
|---|---|
None
|
|
Examples:
>>> data = pd.DataFrame(np.random.random(size=(1000,5)),columns=pd.MultiIndex.from_tuples([("target",""),("G1","a"),("G1","b"),("G2","a"),("G2","b")]))
>>> variable_types = dict([(group, "numerical") for group in data.columns.get_level_values(0).unique()])
>>> asso = CrossSectionalAssociation({"categorical_method":"f_oneway","variable_types":variable_types})
>>> asso
Or with mixed types:
>>> numerical = pd.DataFrame(np.random.random(size=(1000,3)),columns=pd.MultiIndex.from_tuples([("target",None),("G1","a"),("G1","b")]))
>>> categorical = pd.DataFrame(np.random.randint(0,5,size=(1000,3)),columns=pd.MultiIndex.from_tuples([("G2","a"),("G2","b"),("G2","c")]))
>>> data = pd.concat([numerical,categorical], axis="columns")
>>> variable_types = {"target":"numerical","G1":"numerical","G2":"categorical"}
>>> asso = CrossSectionalAssociation({"categorical_method":"f_oneway","variable_types":variable_types})
>>> asso
association(residuals_df: pd.DataFrame, variables_df: pd.DataFrame) -> np.ndarray
Computes the association score between the residuals and candidate time series.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
residuals_df
|
DataFrame
|
DataFrame of shape (nsamples, 1) containing the model residuals of a learning model. The index must be aligned with variables_df. |
required |
variables_df
|
DataFrame
|
DataFrame of shape (nsamples, D) containing the D features to test for association with the residuals. The index must be aligned with residuals_df. The columns must be a pd.MultiIndex instance with two levels. See documentation on data format for precisions. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
pvalues |
array
|
A 1D numpy array containing minus the minimal p-value for each group defined by the first level column index. The coefficients are in the same order as the first level of the column index. We return minus the p-value by convention, as the maximal -pvalue correspond to the maximal association. |
Examples:
>>> rng = np.random.default_rng(0)
>>> data = pd.DataFrame(rng.random(size=(1000,5)),columns=pd.MultiIndex.from_tuples([("target",""),("G1","a"),("G1","b"),("G2","a"),("G2","b")]))
>>> variable_types = dict([(column, "numerical") for column in data.columns.get_level_values(0).unique()])
>>> asso = CrossSectionalAssociation({"categorical_method":"f_oneway","variable_types":variable_types})
>>> asso.association(data[["target"]], data[["G1","G2"]])
array([-0.32736175, -0.11320393])
Or with mixed types:
>>> rng = np.random.default_rng(0)
>>> numerical = pd.DataFrame(rng.random(size=(1000,3)),columns=pd.MultiIndex.from_tuples([("target",None),("G1","a"),("G1","b")]))
>>> categorical = pd.DataFrame(rng.integers(0,5,size=(1000,3)),columns=pd.MultiIndex.from_tuples([("G2","a"),("G2","b"),("G2","c")]))
>>> data = pd.concat([numerical,categorical], axis="columns")
>>> variable_types = {"target":"numerical","G1":"numerical","G2":"categorical"}
>>> asso = CrossSectionalAssociation({"categorical_method":"f_oneway","variable_types":variable_types})
>>> asso.association(data[["target"]],data[["G1","G2"]])
array([-0.05543262, -0.0992026 ])