Skip to content

Partial Correlation

TemporalSlowHk

Bases: PartialCorrelation

Partial correlation for mixed type data, during the equivalence phase.

Notes

Given residuals denoted R, a candidate variable Ca, a condition variable Co, and noting lag i of variable Ca by Ca_i and lag j of variable Co by Co_j, This method computes: 1) the pvalue of R indep Ca_i for all i 2) the pvalue of R indep Co_j for all j 3) the index i1,...,ik corresponding the maximal association between R and Ca_i (minimal pvalues) 4) the index j1,...,jk corresponding the maximal association between R and Co_j (minimal pvalues) 5) the pvalue of R indep Ca_iu | Co_jv, for iu in {i1,...,ik} and jv in {j1,...,jk} 6) the pvalue of R indep Co_jv | Ca_iu, for iu in {i1,...,ik} and jv in {j1,...,jk}

__init__(config: dict)

Initialize the partial correlation object.

Parameters:

Name Type Description Default
config dict

Must contain an entry for: - "lags": int, the number of lags to compute the correlation over - "categorical_method": str, any of 'f_oneway', 'kruskal', 'alexandergovern'. This specifies the kind of test used for categorical data. - "variable_types": dict, for each variable name, whether it is "numerical" or "categorical". See examples. - "k": int, the number of lags to consider for equivalence computation. "k" must be lower or equal to "lags". k must be non-zero and positive.

required

Returns:

Type Description
None

Examples:

>>> data = pd.DataFrame(np.random.random(size=(1000,5)),columns=["target","1","2","3","4"])
>>> variable_types = dict([(column, "numerical") for column in data.columns])
>>> asso = TemporalSlowHk({"lags":10,"categorical_method":"f_oneway","variable_types":variable_types,"k":2})

Or with mixed types:

>>> variable_types = {"target":"numerical","1":"numerical","2":"numerical","3":"categorical","4":"categorical"}
>>> asso = TemporalSlowHk({"lags":10,"categorical_method":"f_oneway","variable_types":variable_types,"k":2})

partial_corr(residuals_df: pd.DataFrame, candidate_df: pd.DataFrame, condition_df: pd.DataFrame) -> tuple[np.array, np.array, np.array, np.array]

Computes the partial correlations between lags.

Parameters:

Name Type Description Default
residuals_df DataFrame

DataFrame of shape (ntimesteps, 1) containing the model residuals of a learning model.

required
candidate_df DataFrame

DataFrame of shape (ntimesteps, 1) containing one of the two univariate time series to test for equivalence. The index must be aligned with residuals_df

required
condition_df DataFrame

DataFrame of shape (ntimesteps, 1) containing one of the two univariate time series to test for equivalence. The index must be aligned with residuals_df

required

Returns:

Name Type Description
p_RCa_Co array

A 2D numpy array of shape (k,k). It contains the p-values of the tests (R indep Ca_i | Co_j), with R the residuals, Ca the candidate TS, Co the condition TS. The first dimension correspond to a retained lag of Ca, the second dimension to a lag of Co.

p_RCo_Ca array

A 2D numpy array of shape (k,k). It contains the p-values of the tests (R indep Co_j | Ca_i). The first dimension correspond to a retained lag of Co, the second dimension to a lag of Ca.

p_RCa array

A 1D numpy array of shape (k,). It contains the p-value of the correlations (R indep Ca_i).

p_RCo array

A 1D numpy array of shape (k,). It contains the p-value of the correlations (R indep Co_i).

Examples:

>>> rng = np.random.default_rng(0)
>>> data = pd.DataFrame(rng.random(size=(1000,5)),columns=["target","1","2","3","4"])
>>> variable_types = dict([(column, "numerical") for column in data.columns])
>>> asso = TemporalSlowHk({"lags":10,"k":2,"categorical_method":"f_oneway","variable_types":variable_types})
>>> asso.partial_corr(data[["target"]],data[["1"]],data[["2"]])
(array([[0.07208953, 0.05627934],
        [0.03686298, 0.04137501]]),
array([[0.09649547, 0.10936624],
        [0.02173326, 0.03464236]]),
array([0.07455153, 0.03384917]),
array([0.09990165, 0.02838155]))

CrossSectionalHk

Bases: PartialCorrelation

Partial correlation for non-temporal, mixed type, grouped data, used during the equivalence phase.

Notes

This class is intended for use with two-level column index dataframes. The first level corresponds to groups of features, over which the association is computed. See documentation on data format for precisions.

Given residuals denoted R, a candidate group Ca, a condition group Co, and noting feature i of group Ca by Ca_i and feature j of group Co by Co_j, This method computes: 1) the pvalue of R indep Ca_i for all i 2) the pvalue of R indep Co_j for all j 3) the index i1,...,ik corresponding the maximal association between R and Ca_i (minimal pvalues) 4) the index j1,...,jk corresponding the maximal association between R and Co_j (minimal pvalues) 5) the pvalue of R indep Ca_iu | Co_jv, for iu in {i1,...,ik} and jv in {j1,...,jk} 6) the pvalue of R indep Co_jv | Ca_iu, for iu in {i1,...,ik} and jv in {j1,...,jk}

__init__(config: dict)

Initialize the partial correlation object.

Parameters:

Name Type Description Default
config dict

Must contain an entry for: - "categorical_method": str, any of 'f_oneway', 'kruskal', 'alexandergovern'. This specifies the kind of test used for categorical data. - "variable_types": dict, for each group name (first level of the column index), whether it is "numerical" or "categorical". See examples. - "k": int, the number of features to consider for equivalence computation. If a group has lower than k features, all features are considered. k must be non-zero and positive.

required

Returns:

Type Description
None

Examples:

>>> data = pd.DataFrame(np.random.random(size=(1000,7)),columns=pd.MultiIndex.from_tuples([("target",""),("G1","a"),("G1","b"),("G1","c"),("G2","a"),("G2","b"),("G2","c")]))
>>> variable_types = dict([(group, "numerical") for group in data.columns.get_level_values(0).unique()])
>>> parcorr = CrossSectionalHk({"categorical_method":"f_oneway","variable_types":variable_types,"k":2})
>>> parcorr

Or with mixed types:

>>> numerical = pd.DataFrame(np.random.random(size=(1000,4)),columns=pd.MultiIndex.from_tuples([("target",None),("G1","a"),("G1","b"),("G1","c")]))
>>> categorical = pd.DataFrame(np.random.randint(0,5,size=(1000,3)),columns=pd.MultiIndex.from_tuples([("G2","a"),("G2","b"),("G2","c")]))
>>> data = pd.concat([numerical,categorical], axis="columns")
>>> variable_types = {"target":"numerical","G1":"numerical","G2":"categorical"}
>>> parcorr =  CrossSectionalHk({"categorical_method":"f_oneway","variable_types":variable_types,"k":2})
>>> parcorr

partial_corr(residuals_df: pd.DataFrame, candidate_df: pd.DataFrame, condition_df: pd.DataFrame) -> tuple[np.array, np.array, np.array, np.array]

Computes the partial correlations between features of two different groups.

Parameters:

Name Type Description Default
residuals_df DataFrame

DataFrame of shape (nsamples, 1) containing the model residuals of a learning model.

required
candidate_df DataFrame

DataFrame of shape (nsamples, groupsize1) containing the first group. The index must be aligned with residuals_df. The column index must have two levels, and with a unique group at level 0.

required
condition_df DataFrame

DataFrame of shape (nsamples, groupsize2) containing the second group. The index must be aligned with residuals_df. The column index must have two levels, and with a unique group at level 0.

required

Returns:

Name Type Description
p_RCa_Co array

A 2D numpy array of shape (k,k). It contains the p-values of the tests (R indep Ca_i | Co_j), with R the residuals, Ca the candidate group, Co the condition group. The first dimension correspond to a retained feature of Ca, the second dimension to a feature of Co.

p_RCo_Ca array

A 2D numpy array of shape (k,k). It contains the p-values of the tests (R indep Co_j | Ca_i). The first dimension correspond to a retained feature of Co, the second dimension to a feature of Ca.

p_RCa array

A 1D numpy array of shape (k,). It contains the p-value of the correlations (R indep Ca_i).

p_RCo array

A 1D numpy array of shape (k,). It contains the p-value of the correlations (R indep Co_i).

Examples:

>>> rng = np.random.default_rng(0)
>>> data = pd.DataFrame(rng.random(size=(1000,7)),columns=pd.MultiIndex.from_tuples([("target",""),("G1","a"),("G1","b"),("G1","c"),("G2","a"),("G2","b"),("G2","c")]))
>>> variable_types = dict([(group, "numerical") for group in data.columns.get_level_values(0).unique()])
>>> parcorr = CrossSectionalHk({"categorical_method":"f_oneway","variable_types":variable_types,"k":2})
>>> parcorr.partial_corr(data[["target"]], data[["G1"]], data[["G2"]])
(array([[0.36888558, 0.33352269],[0.0014652 , 0.0013898 ]]),array([[0.2173927 , 0.25930918],[0.07060479, 0.08380666]]), array([0.3808198 , 0.00130095]), array([0.22332839, 0.07809404]))

Or with mixed types:

>>> rng = np.random.default_rng(0)
>>> numerical = pd.DataFrame(rng.random(size=(1000,4)),columns=pd.MultiIndex.from_tuples([("target",None),("G1","a"),("G1","b"),("G1","c")]))
>>> categorical = pd.DataFrame(rng.integers(0,5,size=(1000,3)),columns=pd.MultiIndex.from_tuples([("G2","a"),("G2","b"),("G2","c")]))
>>> data = pd.concat([numerical,categorical], axis="columns")
>>> variable_types = {"target":"numerical","G1":"numerical","G2":"categorical"}
>>> parcorr =  CrossSectionalHk({"categorical_method":"f_oneway","variable_types":variable_types,"k":2})
>>> parcorr.partial_corr(data[["target"]], data[["G1"]], data[["G2"]])
(array([[0.59614113, 0.59506099],[0.03118282, 0.03196968]]),array([[0.73737096, 0.70131846],[0.60607526, 0.57785402]]),array([0.57745635, 0.03535778]), array([0.5959402 , 0.45854079]))