R packages by haghish

DOT - Render and Export DOT Graphs in R

Renders DOT diagram markup language in R and also provides the possibility to export the graphs in PostScript and SVG (Scalable Vector Graphics) formats. In addition, it supports literate programming packages such as 'knitr' and 'rmarkdown'.

Last updated 1 months ago

5.28 score 2 stars 1 dependents 63 scripts 361 downloads

shapley - Weighted Mean SHAP and CI for Robust Feature Assessment in ML Grid

This R package introduces Weighted Mean SHapley Additive exPlanations (WMSHAP), an innovative method for calculating SHAP values for a grid of fine-tuned base-learner machine learning models as well as stacked ensembles, a method not previously available due to the common reliance on single best-performing models. By integrating the weighted mean SHAP values from individual base-learners comprising the ensemble or individual base-learners in a tuning grid search, the package weights SHAP contributions according to each model's performance, assessed by multiple either R squared (for both regression and classification models). alternatively, this software also offers weighting SHAP values based on the area under the precision-recall curve (AUCPR), the area under the curve (AUC), and F2 measures for binary classifiers. It further extends this framework to implement weighted confidence intervals for weighted mean SHAP values, offering a more comprehensive and robust feature importance evaluation over a grid of machine learning models, instead of solely computing SHAP values for the best model. This methodology is particularly beneficial for addressing the severe class imbalance (class rarity) problem by providing a transparent, generalized measure of feature importance that mitigates the risk of reporting SHAP values for an overfitted or biased model and maintains robustness under severe class imbalance, where there is no universal criteria of identifying the absolute best model. Furthermore, the package implements hypothesis testing to ascertain the statistical significance of SHAP values for individual features, as well as comparative significance testing of SHAP contributions between features. Additionally, it tackles a critical gap in feature selection literature by presenting criteria for the automatic feature selection of the most important features across a grid of models or stacked ensembles, eliminating the need for arbitrary determination of the number of top features to be extracted. This utility is invaluable for researchers analyzing feature significance, particularly within severely imbalanced outcomes where conventional methods fall short. Moreover, it is also expected to report democratic feature importance across a grid of models, resulting in a more comprehensive and generalizable feature selection. The package further implements a novel method for visualizing SHAP values both at subject level and feature level as well as a plot for feature selection based on the weighted mean SHAP ratios.

Last updated 14 days ago

class-imbalanceclass-imbalance-problemfeature-extractionfeature-importancefeature-selectionmachine-learningmachine-learning-algorithmsshapshap-analysisshap-valuesshapelyshapley-additive-explanationsshapley-decompositionshapley-valueshapley-valuesshapleyvalueweighted-shapweighted-shap-confidence-intervalweighted-shapleyweighted-shapley-ci

5.25 score 15 stars 17 scripts 602 downloads

mlim - Single and Multiple Imputation with Automated Machine Learning

Machine learning algorithms have been used for performing single missing data imputation and most recently, multiple imputations. However, this is the first attempt for using automated machine learning algorithms for performing both single and multiple imputation. Automated machine learning is a procedure for fine-tuning the model automatic, performing a random search for a model that results in less error, without overfitting the data. The main idea is to allow the model to set its own parameters for imputing each variable separately instead of setting fixed predefined parameters to impute all variables of the dataset. Using automated machine learning, the package fine-tunes an Elastic Net (default) or Gradient Boosting, Random Forest, Deep Learning, Extreme Gradient Boosting, or Stacked Ensemble machine learning model (from one or a combination of other supported algorithms) for imputing the missing observations. This procedure has been implemented for the first time by this package and is expected to outperform other packages for imputing missing data that do not fine-tune their models. The multiple imputation is implemented via bootstrapping without letting the duplicated observations to harm the cross-validation procedure, which is the way imputed variables are evaluated. Most notably, the package implements automated procedure for handling imputing imbalanced data (class rarity problem), which happens when a factor variable has a level that is far more prevalent than the other(s). This is known to result in biased predictions, hence, biased imputation of missing data. However, the autobalancing procedure ensures that instead of focusing on maximizing accuracy (classification error) in imputing factor variables, a fairer procedure and imputation method is practiced.

Last updated 8 months ago

automatic-machine-learningautomlclassimbalancedata-scienceelastic-netextreme-gradient-boostinggbmglmgradient-boostinggradient-boosting-machineimputationimputation-algorithmimputation-methodsmachine-learningmissing-datamultipleimputationstack-ensemble

4.49 score 31 stars 7 scripts 231 downloads

autoEnsemble - Automated Stacked Ensemble Classifier for Severe Class Imbalance

A stacking solution for modeling imbalanced and severely skewed data. It automates the process of building homogeneous or heterogeneous stacked ensemble models by selecting "best" models according to different criteria. In doing so, it strategically searches for and selects diverse, high-performing base-learners to construct ensemble models optimized for skewed data. This package is particularly useful for addressing class imbalance in datasets, ensuring robust and effective model outcomes through advanced ensemble strategies which aim to stabilize the model, reduce its overfitting, and further improve its generalizability.

Last updated 9 days ago

aialgorithmautomated-machine-learningautomlautoml-algorithmsensembleensemble-learningh2oh2oaimachine-learningmachinelearningmetalearningstack-ensemblestacked-ensemblesstacking

4.42 score 5 stars 21 scripts 272 downloads

md.log - Produces Markdown Log File with a Built-in Function Call

Produces clean and neat Markdown log file and also provide an argument to include the function call inside the Markdown log.

Last updated 3 years ago

4.13 score 1 stars 1 dependents 1 scripts 196 downloads

h2otools - Machine Learning Model Evaluation for 'h2o' Package

Enhances the H2O platform by providing tools for detailed evaluation of machine learning models. It includes functions for bootstrapped performance evaluation, extended F-score calculations, and various other metrics, aimed at improving model assessment.

Last updated 16 days ago

4.10 score 2 stars 1 dependents 14 scripts 224 downloads

HMDA - Holistic Multimodel Domain Analysis for Exploratory Machine Learning

Holistic Multimodel Domain Analysis (HMDA) is a robust and transparent framework designed for exploratory machine learning research, aiming to enhance the process of feature assessment and selection. HMDA addresses key limitations of traditional machine learning methods by evaluating the consistency across multiple high-performing models within a fine-tuned modeling grid, thereby improving the interpretability and reliability of feature importance assessments. Specifically, it computes Weighted Mean SHapley Additive exPlanations (WMSHAP), which aggregate feature contributions from multiple models based on weighted performance metrics. HMDA also provides confidence intervals to demonstrate the stability of these feature importance estimates. This framework is particularly beneficial for analyzing complex, multidimensional datasets common in health research, supporting reliable exploration of mental health outcomes such as suicidal ideation, suicide attempts, and other psychological conditions. Additionally, HMDA includes automated procedures for feature selection based on WMSHAP ratios and performs dimension reduction analyses to identify underlying structures among features. For more details see Haghish (2025) <doi:10.13140/RG.2.2.32473.63846>.

Last updated 2 days ago

ensemble-feature-importanceexplainable-aiexplainable-artificial-intelligenceexplainable-machine-learningexplainable-mlexploratory-machine-learningexploratory-modellingfeature-importancefeature-selection-methodsholistic-modelingholistic-multimodel-domain-analysismultimodel-ensemblereproducible-aireproducible-researchrobust-feature-selectionshapley-additive-explanationsshapley-valuestransparent-aiweighted-mean-shapwmshap

3.54 score 1 stars 59 downloads

adjROC - Computing Sensitivity at a Fix Value of Specificity and Vice Versa as Well as Bootstrap Metrics for ROC Curves

This software assesses the receiver operating characteristic (ROC) curve at adjusted thresholds, enabling the comparison of sensitivity and specificity across multiple binary classification models. Instead of comparing different models with varied cutoff values in their risk thresholds, all models can be compared at a fixed threshold of sensitivity, a fixed threshold of specificity, or the crossing point between sensitivity and specificity. If a threshold for specificity is given (e.g., specificity = 0.9), sensitivity and its confidence interval are computed, and vice versa. If the threshold for either sensitivity or specificity is not provided, the crossing point between the sensitivity and specificity curves is returned, along with their confidence intervals. For bootstrap procedures, the software evaluates the mean and CI bootstrap values for sensitivity, specificity, and the crossing point between specificity and sensitivity. This allows users to discern whether the performance of a model (based on adjusted sensitivity or adjusted specificity) is significantly different from other models. This software addresses the issue of comparing different classification models with varying predefined cutoff thresholds, which often leads to inconclusive results due to the fluctuating values of both sensitivity and specificity.

Last updated 9 months ago

2.70 score 349 downloads