Title: | Machine Learning Model Evaluation for 'h2o' Package |
---|---|
Description: | In the process of model selection, the common practice is to select a model with the higher performance. However, the fine-tuning process might tune multiple models with negligible performance differences. This software provides a statistical procedure for comparing the performance of machine learning models using a bootstrapping technique to assess significant differences between models' performances. Additionally, it offers extra performance metrics such as the F-Measure and additional functionalities for working with the H2O AI software package. For more information about the 'h2o' package, visit https://h2o.ai/. |
Authors: | E. F. Haghish [aut, cre, cph] |
Maintainer: | E. F. Haghish <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.4 |
Built: | 2024-11-22 04:36:39 UTC |
Source: | https://github.com/haghish/h2otools |
Extracts models' parameters from AutoML grid
automlModelParam(model)
automlModelParam(model)
model |
a h2o AutoML object |
a dataframe of models' parameters
E. F. Haghish
## Not run: if(requireNamespace("h2o")) { library(h2o) h2o.init(ignore_config = TRUE, nthreads = 2, bind_to_localhost = FALSE, insecure = TRUE) prostate_path <- system.file("extdata", "prostate.csv", package = "h2o") prostate <- h2o.importFile(path = prostate_path, header = TRUE) y <- "CAPSULE" prostate[,y] <- as.factor(prostate[,y]) #convert to factor for classification aml <- h2o.automl(y = y, training_frame = prostate, include_algos = "GLM", max_models = 1, max_runtime_secs = 60) # extract the model parameters model.param <- automlModelParam(aml@leader) } ## End(Not run)
## Not run: if(requireNamespace("h2o")) { library(h2o) h2o.init(ignore_config = TRUE, nthreads = 2, bind_to_localhost = FALSE, insecure = TRUE) prostate_path <- system.file("extdata", "prostate.csv", package = "h2o") prostate <- h2o.importFile(path = prostate_path, header = TRUE) y <- "CAPSULE" prostate[,y] <- as.factor(prostate[,y]) #convert to factor for classification aml <- h2o.automl(y = y, training_frame = prostate, include_algos = "GLM", max_models = 1, max_runtime_secs = 60) # extract the model parameters model.param <- automlModelParam(aml@leader) } ## End(Not run)
Evaluates variable importance as well as bootstrapped variable importance for a single model or a grid of models
bootImportance(model, df, metric, n = 100)
bootImportance(model, df, metric, n = 100)
model |
a model or a model grid of models trained by h2o machine learning software |
df |
dataset for testing the model. if "n" is bigger than 1, this dataset will be used for drawing bootstrap samples. otherwise (default), the entire dataset will be used for evaluating the model |
metric |
character. model evaluation metric to be passed to boot R package. this could be, for example "AUC", "AUCPR", RMSE", etc., depending of the model you have trained. all evaluation metrics provided for your H2O models can be specified here. |
n |
number of bootstraps |
list of mean perforance of the specified metric and other bootstrap results
E. F. Haghish
## Not run: library(h2o) h2o.init(ignore_config = TRUE, nthreads = 2, bind_to_localhost = FALSE, insecure = TRUE) prostate_path <- system.file("extdata", "prostate.csv", package = "h2o") df <- read.csv(prostate_path) # prepare the dataset for analysis before converting it to h2o frame. df$CAPSULE <- as.factor(df$CAPSULE) # convert the dataframe to H2OFrame and run the analysis prostate.hex <- as.h2o(df) aml <- h2o.automl(y = "CAPSULE", training_frame = prostate.hex, max_runtime_secs = 30) # evaluate the model performance perf <- h2o.performance(aml@leader, xval = TRUE) # evaluate bootstrap performance for the training dataset # NOTE that the raw data is given not the 'H2OFrame' perf <- bootPerformance(model = aml@leader, df = df, metric = "RMSE", n = 500) ## End(Not run)
## Not run: library(h2o) h2o.init(ignore_config = TRUE, nthreads = 2, bind_to_localhost = FALSE, insecure = TRUE) prostate_path <- system.file("extdata", "prostate.csv", package = "h2o") df <- read.csv(prostate_path) # prepare the dataset for analysis before converting it to h2o frame. df$CAPSULE <- as.factor(df$CAPSULE) # convert the dataframe to H2OFrame and run the analysis prostate.hex <- as.h2o(df) aml <- h2o.automl(y = "CAPSULE", training_frame = prostate.hex, max_runtime_secs = 30) # evaluate the model performance perf <- h2o.performance(aml@leader, xval = TRUE) # evaluate bootstrap performance for the training dataset # NOTE that the raw data is given not the 'H2OFrame' perf <- bootPerformance(model = aml@leader, df = df, metric = "RMSE", n = 500) ## End(Not run)
Evaluate model performance by bootstrapping from training dataset
bootPerformance(model, df, metric, n = 100)
bootPerformance(model, df, metric, n = 100)
model |
a model trained by h2o machine learning software |
df |
training, validation, or testing dataset to bootstrap from |
metric |
character. model evaluation metric to be passed to boot R package. this could be, for example "AUC", "AUCPR", RMSE", etc., depending of the model you have trained. all evaluation metrics provided for your H2O models can be specified here. |
n |
number of bootstraps |
list of mean perforance of the specified metric and other bootstrap results
E. F. Haghish
## Not run: library(h2o) h2o.init(ignore_config = TRUE, nthreads = 2, bind_to_localhost = FALSE, insecure = TRUE) prostate_path <- system.file("extdata", "prostate.csv", package = "h2o") df <- read.csv(prostate_path) # prepare the dataset for analysis before converting it to h2o frame. df$CAPSULE <- as.factor(df$CAPSULE) # convert the dataframe to H2OFrame and run the analysis prostate.hex <- as.h2o(df) aml <- h2o.automl(y = "CAPSULE", training_frame = prostate.hex, max_runtime_secs = 30) # evaluate the model performance perf <- h2o.performance(aml@leader, xval = TRUE) # evaluate bootstrap performance for the training dataset # NOTE that the raw data is given not the 'H2OFrame' perf <- bootPerformance(model = aml@leader, df = df, metric = "RMSE", n = 500) ## End(Not run)
## Not run: library(h2o) h2o.init(ignore_config = TRUE, nthreads = 2, bind_to_localhost = FALSE, insecure = TRUE) prostate_path <- system.file("extdata", "prostate.csv", package = "h2o") df <- read.csv(prostate_path) # prepare the dataset for analysis before converting it to h2o frame. df$CAPSULE <- as.factor(df$CAPSULE) # convert the dataframe to H2OFrame and run the analysis prostate.hex <- as.h2o(df) aml <- h2o.automl(y = "CAPSULE", training_frame = prostate.hex, max_runtime_secs = 30) # evaluate the model performance perf <- h2o.performance(aml@leader, xval = TRUE) # evaluate bootstrap performance for the training dataset # NOTE that the raw data is given not the 'H2OFrame' perf <- bootPerformance(model = aml@leader, df = df, metric = "RMSE", n = 500) ## End(Not run)
checks the class of the input data.frame, makes sure that the specified 'df' is indeed a data.frame and more over, there is no column with class 'character' or 'ordered' in the data.frame. this function helps you ensure that your data is compatible with h2o R package.
checkFrame(df, ignore = NULL, is.df = TRUE, no.char = TRUE, no.ordered = TRUE)
checkFrame(df, ignore = NULL, is.df = TRUE, no.char = TRUE, no.ordered = TRUE)
df |
data.frame object to evaluate |
ignore |
character vector of column names that should be ignored, if any. |
is.df |
logical. if TRUE, it examines if the 'df' is 'data.frame' |
no.char |
logical. if TRUE, it examines if the 'df' has any columns of class 'character' |
no.ordered |
logical. if TRUE, it examines if the 'df' has any columns of class 'ordered' factors |
nothing
E. F. Haghish
data(cars) # no error is expected because 'cars' dataset does not # have 'ordered' or 'character' columns checkFrame(cars)
data(cars) # no error is expected because 'cars' dataset does not # have 'ordered' or 'character' columns checkFrame(cars)
Calculates F-Measure for any given value of Beta
Fmeasure(perf, beta = 1, max = FALSE)
Fmeasure(perf, beta = 1, max = FALSE)
perf |
a h2o object of class |
beta |
numeric, specifying beta value, which must be higher than zero |
max |
logical. default is FALSE. if TRUE, instead of providing the F-Measure for all the thresholds, the highest F-Measure is reported. |
a matrix of F-Measures for different thresholds or the highest F-Measure value
E. F. Haghish
## Not run: library(h2o) h2o.init(ignore_config = TRUE, nthreads = 2, bind_to_localhost = FALSE, insecure = TRUE) prostate_path <- system.file("extdata", "prostate.csv", package = "h2o") prostate <- h2o.importFile(path = prostate_path, header = TRUE) y <- "CAPSULE" prostate[,y] <- as.factor(prostate[,y]) #convert to factor for classification aml <- h2o.automl(y = y, training_frame = prostate, max_runtime_secs = 30) # evaluate the model performance perf <- h2o.performance(aml@leader, xval = TRUE) # evaluate F-Measure for a Beta = 3 Fmeasure(perf, beta = 3, max = TRUE) # evaluate F-Measure for a Beta = 1.5 Fmeasure(perf, beta = 1.5, max = TRUE) # evaluate F-Measure for a Beta = 4 Fmeasure(perf, beta = 4, max = TRUE) ## End(Not run)
## Not run: library(h2o) h2o.init(ignore_config = TRUE, nthreads = 2, bind_to_localhost = FALSE, insecure = TRUE) prostate_path <- system.file("extdata", "prostate.csv", package = "h2o") prostate <- h2o.importFile(path = prostate_path, header = TRUE) y <- "CAPSULE" prostate[,y] <- as.factor(prostate[,y]) #convert to factor for classification aml <- h2o.automl(y = y, training_frame = prostate, max_runtime_secs = 30) # evaluate the model performance perf <- h2o.performance(aml@leader, xval = TRUE) # evaluate F-Measure for a Beta = 3 Fmeasure(perf, beta = 3, max = TRUE) # evaluate F-Measure for a Beta = 1.5 Fmeasure(perf, beta = 1.5, max = TRUE) # evaluate F-Measure for a Beta = 4 Fmeasure(perf, beta = 4, max = TRUE) ## End(Not run)
retrieve performance matrix for all thresholds
getPerfMatrix(perf)
getPerfMatrix(perf)
perf |
a h2o object of class |
a matrix of F-Measures for different thresholds or the highest F-Measure value
E. F. Haghish
## Not run: library(h2o) h2o.init(ignore_config = TRUE, nthreads = 2, bind_to_localhost = FALSE, insecure = TRUE) prostate_path <- system.file("extdata", "prostate.csv", package = "h2o") prostate <- h2o.importFile(path = prostate_path, header = TRUE) y <- "CAPSULE" prostate[,y] <- as.factor(prostate[,y]) #convert to factor for classification aml <- h2o.automl(y = y, training_frame = prostate, max_runtime_secs = 30) # evaluate the model performance perf <- h2o.performance(aml@leader, xval = TRUE) # get the performance matrix for all thresholds getPerfMatrix(perf) ## End(Not run)
## Not run: library(h2o) h2o.init(ignore_config = TRUE, nthreads = 2, bind_to_localhost = FALSE, insecure = TRUE) prostate_path <- system.file("extdata", "prostate.csv", package = "h2o") prostate <- h2o.importFile(path = prostate_path, header = TRUE) y <- "CAPSULE" prostate[,y] <- as.factor(prostate[,y]) #convert to factor for classification aml <- h2o.automl(y = y, training_frame = prostate, max_runtime_secs = 30) # evaluate the model performance perf <- h2o.performance(aml@leader, xval = TRUE) # get the performance matrix for all thresholds getPerfMatrix(perf) ## End(Not run)
extracts the model IDs from H2O AutoML object or H2O grid
h2o.get_ids(automl)
h2o.get_ids(automl)
automl |
a h2o |
a character vector of trained models' names (IDs)
E. F. Haghish
## Not run: library(h2o) h2o.init(ignore_config = TRUE, nthreads = 2, bind_to_localhost = FALSE, insecure = TRUE) prostate_path <- system.file("extdata", "prostate.csv", package = "h2o") prostate <- h2o.importFile(path = prostate_path, header = TRUE) y <- "CAPSULE" prostate[,y] <- as.factor(prostate[,y]) #convert to factor for classification aml <- h2o.automl(y = y, training_frame = prostate, max_runtime_secs = 30) # get the model IDs ids <- h2o.ids(aml) ## End(Not run)
## Not run: library(h2o) h2o.init(ignore_config = TRUE, nthreads = 2, bind_to_localhost = FALSE, insecure = TRUE) prostate_path <- system.file("extdata", "prostate.csv", package = "h2o") prostate <- h2o.importFile(path = prostate_path, header = TRUE) y <- "CAPSULE" prostate[,y] <- as.factor(prostate[,y]) #convert to factor for classification aml <- h2o.automl(y = y, training_frame = prostate, max_runtime_secs = 30) # get the model IDs ids <- h2o.ids(aml) ## End(Not run)
Calculates kappa for all thresholds
kappa(perf, max = FALSE)
kappa(perf, max = FALSE)
perf |
a h2o object of class |
max |
logical. default is FALSE. if TRUE, instead of providing the F-Measure for all the thresholds, the highest F-Measure is reported. |
a matrix of F-Measures for different thresholds or the highest F-Measure value
E. F. Haghish
## Not run: library(h2o) h2o.init(ignore_config = TRUE, nthreads = 2, bind_to_localhost = FALSE, insecure = TRUE) prostate_path <- system.file("extdata", "prostate.csv", package = "h2o") prostate <- h2o.importFile(path = prostate_path, header = TRUE) y <- "CAPSULE" prostate[,y] <- as.factor(prostate[,y]) #convert to factor for classification aml <- h2o.automl(y = y, training_frame = prostate, max_runtime_secs = 30) # evaluate the model performance perf <- h2o.performance(aml@leader, xval = TRUE) # evaluate F-Measure for a Beta = 3 kappa(perf, max = TRUE) ## End(Not run)
## Not run: library(h2o) h2o.init(ignore_config = TRUE, nthreads = 2, bind_to_localhost = FALSE, insecure = TRUE) prostate_path <- system.file("extdata", "prostate.csv", package = "h2o") prostate <- h2o.importFile(path = prostate_path, header = TRUE) y <- "CAPSULE" prostate[,y] <- as.factor(prostate[,y]) #convert to factor for classification aml <- h2o.automl(y = y, training_frame = prostate, max_runtime_secs = 30) # evaluate the model performance perf <- h2o.performance(aml@leader, xval = TRUE) # evaluate F-Measure for a Beta = 3 kappa(perf, max = TRUE) ## End(Not run)
takes h2o performance object of class "H2OBinomialMetrics" alongside caret confusion matrix and provides different model performance measures supported by h2o and caret
performance(perf)
performance(perf)
perf |
h2o performance object of class "H2OBinomialMetrics" |
numeric vector
E. F. Haghish
## Not run: library(h2o) h2o.init(ignore_config = TRUE, nthreads = 2, bind_to_localhost = FALSE, insecure = TRUE) prostate_path <- system.file("extdata", "prostate.csv", package = "h2o") prostate <- h2o.importFile(path = prostate_path, header = TRUE) y <- "CAPSULE" prostate[,y] <- as.factor(prostate[,y]) #convert to factor for classification aml <- h2o.automl(y = y, training_frame = prostate, max_runtime_secs = 30) # evaluate the model performance perf <- h2o.performance(aml@leader, xval = TRUE) # compute more performance measures performance(perf) ## End(Not run)
## Not run: library(h2o) h2o.init(ignore_config = TRUE, nthreads = 2, bind_to_localhost = FALSE, insecure = TRUE) prostate_path <- system.file("extdata", "prostate.csv", package = "h2o") prostate <- h2o.importFile(path = prostate_path, header = TRUE) y <- "CAPSULE" prostate[,y] <- as.factor(prostate[,y]) #convert to factor for classification aml <- h2o.automl(y = y, training_frame = prostate, max_runtime_secs = 30) # evaluate the model performance perf <- h2o.performance(aml@leader, xval = TRUE) # compute more performance measures performance(perf) ## End(Not run)