Correlation Measures

`c1`(X, y[, normalize])	Calculates the maximum feature correlationto the output (C1) metric.
`c2`(X, y[, normalize])	Calculates the average feature correlationto the output (C2) metric.
`c3`(X, y[, is_optimized, normalize])	Calculates the individual feature efficiency (C3) metric.
`c4`(X, y[, normalize])	Calculates the collective feature efficiency (C4) metric.

problexity.regression.c1(X, y, normalize=True)

Calculates the maximum feature correlationto the output (C1) metric.

Measure returns maximum value out of all feature-output Spearman correlation absolute value. Higher values indicate simpler problems.

\[C1=max_{j=1,..,d}|\rho(x^j, y)|\]

Parameters:

X (array-like, shape (n_samples, n_features)) – Dataset
y (array-like, shape (n_samples)) – Labels for regression task

Return type:

float

Returns:

C1 score

problexity.regression.c2(X, y, normalize=True)

Calculates the average feature correlationto the output (C2) metric.

Measure returns average value of all feature-output Spearman correlation absolute value. Higher values indicate simpler problems.

\[C2=\sum^{d}_{j=1}\frac{|\rho(x^j, y)|}{d}\]

Parameters:

X (array-like, shape (n_samples, n_features)) – Dataset
y (array-like, shape (n_samples)) – Labels for regression task

Return type:

float

Returns:

C2 score

problexity.regression.c3(X, y, is_optimized=True, normalize=True)

Calculates the individual feature efficiency (C3) metric.

Measure is calculated based on a number of examples that have to be removed in order to obtain a high correlation value. Removes samples based on residual value of linear regression model. The is_optimized flag value allows using optimized algorithm, based on divide and conquer strategy.

\[C3=min_{j=1}^{d}\frac{n^j}{n}\]

Parameters:

X (array-like, shape (n_samples, n_features)) – Dataset
y (array-like, shape (n_samples)) – Labels for regression task

Return type:

float

Returns:

C3 score

problexity.regression.c4(X, y, normalize=True)

Calculates the collective feature efficiency (C4) metric.

It sequentially analyzes the features with the greatest correlation to the output until all the features are used or all instances are removed. Samples with low resudual value are removed. A metric is computed based on the number of samples remaining after removal procedure. By default, 0-1 interval normalization is used. The iterations limit of 1000 was introduced.

\[C4=\frac{\#\{x_i||\epsilon_i|>0.1\}_{T_l}}{n}\]

Parameters:

X (array-like, shape (n_samples, n_features)) – Dataset
y (array-like, shape (n_samples)) – Labels for regression task

Return type:

float

Returns:

C4 score