| Type: | Package | 
| Title: | Predictive Power Score | 
| Version: | 0.0.5 | 
| Description: | The Predictive Power Score (PPS) is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two variables. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). PPS can be useful for data exploration purposes, in the same way correlation analysis is. For more information on PPS, see https://github.com/paulvanderlaken/ppsr. | 
| License: | GPL (≥ 3) | 
| Encoding: | UTF-8 | 
| Suggests: | testthat (≥ 2.0.0) | 
| Config/testthat/edition: | 3 | 
| Config/testthat/parallel: | true | 
| RoxygenNote: | 7.2.3 | 
| Imports: | ggplot2 (≥ 3.3.3), parsnip (≥ 0.1.5), rpart (≥ 4.1.15), withr (≥ 2.4.1), gridExtra (≥ 2.3) | 
| NeedsCompilation: | no | 
| Packaged: | 2024-02-18 11:57:33 UTC; pvdl | 
| Author: | Paul van der Laken [aut, cre, cph] | 
| Maintainer: | Paul van der Laken <paulvanderlaken@gmail.com> | 
| Repository: | CRAN | 
| Date/Publication: | 2024-02-18 12:30:02 UTC | 
ppsr: An R implementation of the Predictive Power Score (PPS)
Description
The PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two columns. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). It can be used as an alternative to the correlation (matrix).
Lists all algorithms currently supported
Description
Lists all algorithms currently supported
Usage
available_algorithms()
Value
a list of all available parsnip engines
Examples
available_algorithms()
Lists all evaluation metrics currently supported
Description
Lists all evaluation metrics currently supported
Usage
available_evaluation_metrics()
Value
a list of all available evaluation metrics and their implementation in functional form
Examples
available_evaluation_metrics()
Normalizes the original score compared to a naive baseline score The calculation that's being performed depends on the type of model
Description
Normalizes the original score compared to a naive baseline score The calculation that's being performed depends on the type of model
Usage
normalize_score(baseline_score, model_score, type)
Arguments
baseline_score | 
 float, the evaluation metric score for a naive baseline (model)  | 
model_score | 
 float, the evaluation metric score for a statistical model  | 
type | 
 character, type of model  | 
Value
numeric vector of length one, normalized score
Calculate predictive power score for x on y
Description
Calculate predictive power score for x on y
Usage
score(
  df,
  x,
  y,
  algorithm = "tree",
  metrics = list(regression = "MAE", classification = "F1_weighted"),
  cv_folds = 5,
  seed = 1,
  verbose = TRUE
)
Arguments
df | 
 data.frame containing columns for x and y  | 
x | 
 string, column name of predictor variable  | 
y | 
 string, column name of target variable  | 
algorithm | 
 string, see   | 
metrics | 
 named list of   | 
cv_folds | 
 float, number of cross-validation folds  | 
seed | 
 float, seed to ensure reproducibility/stability  | 
verbose | 
 boolean, whether to print notifications  | 
Value
a named list, potentially containing
- x
 the name of the predictor variable
- y
 the name of the target variable
- result_type
 text showing how to interpret the resulting score
- pps
 the predictive power score
- metric
 the evaluation metric used to compute the PPS
- baseline_score
 the score of a naive model on the evaluation metric
- model_score
 the score of the predictive model on the evaluation metric
- cv_folds
 how many cross-validation folds were used
- seed
 the seed that was set
- algorithm
 text shwoing what algorithm was used
- model_type
 text showing whether classification or regression was used
Examples
score(iris, x = 'Petal.Length', y = 'Species')
Calculate correlation coefficients for whole dataframe
Description
Calculate correlation coefficients for whole dataframe
Usage
score_correlations(df, ...)
Arguments
df | 
 data.frame containing columns for x and y  | 
... | 
 arguments to pass to   | 
Value
a data.frame with x-y correlation coefficients
Examples
score_correlations(iris)
Calculate predictive power scores for whole dataframe
Iterates through the columns of the dataframe, calculating the predictive power
score for every possible combination of x and y.
Description
Calculate predictive power scores for whole dataframe
Iterates through the columns of the dataframe, calculating the predictive power
score for every possible combination of x and y.
Usage
score_df(df, ..., do_parallel = FALSE, n_cores = -1)
Arguments
df | 
 data.frame containing columns for x and y  | 
... | 
 any arguments passed to   | 
do_parallel | 
 bool, whether to perform   | 
n_cores | 
 numeric, number of cores to use, defaults to maximum minus 1  | 
Value
a data.frame containing
- x
 the name of the predictor variable
- y
 the name of the target variable
- result_type
 text showing how to interpret the resulting score
- pps
 the predictive power score
- metric
 the evaluation metric used to compute the PPS
- baseline_score
 the score of a naive model on the evaluation metric
- model_score
 the score of the predictive model on the evaluation metric
- cv_folds
 how many cross-validation folds were used
- seed
 the seed that was set
- algorithm
 text shwoing what algorithm was used
- model_type
 text showing whether classification or regression was used
Examples
score_df(iris)
score_df(mtcars, do_parallel = TRUE, n_cores = 2)
Calculate predictive power score matrix
Iterates through the columns of the dataset, calculating the predictive power
score for every possible combination of x and y.
Description
Note that the targets are on the rows, and the features on the columns.
Usage
score_matrix(df, ...)
Arguments
df | 
 data.frame containing columns for x and y  | 
... | 
 any arguments passed to   | 
Value
a matrix of numeric values, representing predictive power scores
Examples
score_matrix(iris)
score_matrix(mtcars, do_parallel = TRUE, n_cores=2)
Calculates out-of-sample model performance of a statistical model
Description
Calculates out-of-sample model performance of a statistical model
Usage
score_model(train, test, model, x, y, metric)
Arguments
train | 
 df, training data, containing variable y  | 
test | 
 df, test data, containing variable y  | 
model | 
 parsnip model object, with mode preset  | 
x | 
 character, column name of predictor variable  | 
y | 
 character, column name of target variable  | 
metric | 
 character, name of evaluation metric being used, see   | 
Value
numeric vector of length one, evaluation score for predictions using naive model
Calculate out-of-sample model performance of naive baseline model The calculation that's being performed depends on the type of model For regression models, the mean is used as prediction For classification, a model predicting random values and a model predicting modal values are used and the best model is taken as baseline score
Description
Calculate out-of-sample model performance of naive baseline model The calculation that's being performed depends on the type of model For regression models, the mean is used as prediction For classification, a model predicting random values and a model predicting modal values are used and the best model is taken as baseline score
Usage
score_naive(train, test, x, y, type, metric)
Arguments
train | 
 df, training data, containing variable y  | 
test | 
 df, test data, containing variable y  | 
x | 
 character, column name of predictor variable  | 
y | 
 character, column name of target variable  | 
type | 
 character, type of model  | 
metric | 
 character, evaluation metric being used  | 
Value
numeric vector of length one, evaluation score for predictions using naive model
Calculate predictive power scores for y
Calculates the predictive power scores for the specified y variable
using every column in the dataset as x, including itself.
Description
Calculate predictive power scores for y
Calculates the predictive power scores for the specified y variable
using every column in the dataset as x, including itself.
Usage
score_predictors(df, y, ..., do_parallel = FALSE, n_cores = -1)
Arguments
df | 
 data.frame containing columns for x and y  | 
y | 
 string, column name of target variable  | 
... | 
 any arguments passed to   | 
do_parallel | 
 bool, whether to perform   | 
n_cores | 
 numeric, number of cores to use, defaults to maximum minus 1  | 
Value
a data.frame containing
- x
 the name of the predictor variable
- y
 the name of the target variable
- result_type
 text showing how to interpret the resulting score
- pps
 the predictive power score
- metric
 the evaluation metric used to compute the PPS
- baseline_score
 the score of a naive model on the evaluation metric
- model_score
 the score of the predictive model on the evaluation metric
- cv_folds
 how many cross-validation folds were used
- seed
 the seed that was set
- algorithm
 text shwoing what algorithm was used
- model_type
 text showing whether classification or regression was used
Examples
score_predictors(df = iris, y = 'Species')
score_predictors(df = mtcars, y = 'mpg', do_parallel = TRUE, n_cores = 2)
Visualize the PPS & correlation matrices
Description
Visualize the PPS & correlation matrices
Usage
visualize_both(
  df,
  color_value_positive = "#08306B",
  color_value_negative = "#8b0000",
  color_text = "#FFFFFF",
  include_missings = TRUE,
  nrow = 1,
  ...
)
Arguments
df | 
 data.frame containing columns for x and y  | 
color_value_positive | 
 color used for upper limit of gradient (high positive correlation)  | 
color_value_negative | 
 color used for lower limit of gradient (high negative correlation)  | 
color_text | 
 string, hex value or color name used for text, best to pick high contrast with   | 
include_missings | 
 bool, whether to include the variables without correlation values in the plot  | 
nrow | 
 numeric, number of rows, either 1 or 2  | 
... | 
 any arguments passed to   | 
Value
a grob object, a grid with two ggplot2 heatmap visualizations
Examples
visualize_both(iris)
visualize_both(mtcars, do_parallel = TRUE, n_cores = 2)
Visualize the correlation matrix
Description
Visualize the correlation matrix
Usage
visualize_correlations(
  df,
  color_value_positive = "#08306B",
  color_value_negative = "#8b0000",
  color_text = "#FFFFFF",
  include_missings = FALSE,
  ...
)
Arguments
df | 
 data.frame containing columns for x and y  | 
color_value_positive | 
 color used for upper limit of gradient (high positive correlation)  | 
color_value_negative | 
 color used for lower limit of gradient (high negative correlation)  | 
color_text | 
 color used for text, best to pick high contrast with   | 
include_missings | 
 bool, whether to include the variables without correlation values in the plot  | 
... | 
 arguments to pass to   | 
Value
a ggplot object, a heatmap visualization
Examples
visualize_correlations(iris)
Visualize the Predictive Power scores of the entire dataframe, or given a target
Description
If y is specified, visualize_pps returns a barplot of the PPS of
every predictor on the specified target variable.
If y is not specified, visualize_pps returns a heatmap visualization
of the PPS for all X-Y combinations in a dataframe.
Usage
visualize_pps(
  df,
  y = NULL,
  color_value_high = "#08306B",
  color_value_low = "#FFFFFF",
  color_text = "#FFFFFF",
  include_target = TRUE,
  ...
)
Arguments
df | 
 data.frame containing columns for x and y  | 
y | 
 string, column name of target variable,
can be left   | 
color_value_high | 
 string, hex value or color name used for upper limit of PPS gradient (high PPS)  | 
color_value_low | 
 string, hex value or color name used for lower limit of PPS gradient (low PPS)  | 
color_text | 
 string, hex value or color name used for text, best to pick high contrast with   | 
include_target | 
 boolean, whether to include the target variable in the barplot  | 
... | 
 any arguments passed to   | 
Value
a ggplot object, a vertical barplot or heatmap visualization
Examples
visualize_pps(iris, y = 'Species')
visualize_pps(iris)
visualize_pps(mtcars, do_parallel = TRUE, n_cores = 2)