Title: | Fit Log-Ratio Lasso Regression for Compositional Data |
---|---|
Description: | Log-ratio Lasso regression for continuous, binary, and survival outcomes with (longitudinal) compositional features. See Fei and others (2024) <doi:10.1016/j.crmeth.2024.100899>. |
Authors: | Teng Fei [aut, cre, cph] |
Maintainer: | Teng Fei <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.4.0 |
Built: | 2025-02-17 03:38:45 UTC |
Source: | https://github.com/vdblab/floral |
Summarizing FLORAL
outputs from various choices of a
a.FLORAL( a = c(0.1, 0.5, 1), ncore = 1, seed = NULL, x, y, ncov = 0, family = "gaussian", longitudinal = FALSE, id = NULL, tobs = NULL, failcode = NULL, corstr = "exchangeable", scalefix = FALSE, scalevalue = 1, pseudo = 1, length.lambda = 100, lambda.min.ratio = NULL, ncov.lambda.weight = 0, mu = 1, pfilter = 0, maxiter = 100, ncv = 5, intercept = FALSE, step2 = FALSE, progress = TRUE )
a.FLORAL( a = c(0.1, 0.5, 1), ncore = 1, seed = NULL, x, y, ncov = 0, family = "gaussian", longitudinal = FALSE, id = NULL, tobs = NULL, failcode = NULL, corstr = "exchangeable", scalefix = FALSE, scalevalue = 1, pseudo = 1, length.lambda = 100, lambda.min.ratio = NULL, ncov.lambda.weight = 0, mu = 1, pfilter = 0, maxiter = 100, ncv = 5, intercept = FALSE, step2 = FALSE, progress = TRUE )
a |
vector of scalars between 0 and 1 for comparison. |
ncore |
Number of cores used for parallel computation. Default is to use only 1 core. |
seed |
A random seed for reproducibility of the results. By default the seed is the numeric form of |
x |
Feature matrix, where rows specify subjects and columns specify features. The first |
y |
Outcome. For a continuous or binary outcome, |
ncov |
An integer indicating the number of first |
family |
Available options are |
longitudinal |
|
id |
If |
tobs |
If |
failcode |
If |
corstr |
If a GEE model is specified, then |
scalefix |
|
scalevalue |
Specify the scale parameter if |
pseudo |
Pseudo count to be added to |
length.lambda |
Number of penalty parameters used in the path |
lambda.min.ratio |
Ratio between the minimum and maximum choice of lambda. Default is |
ncov.lambda.weight |
Weight of the penalty lambda applied to the first |
mu |
Value of penalty for the augmented Lagrangian |
pfilter |
A pre-specified threshold to force coefficients with absolute values less than pfilter times the maximum value of absolute coefficient as zeros in the GEE model. Default is zero, such that all coefficients will be reported. |
maxiter |
Number of iterations needed for the outer loop of the augmented Lagrangian algorithm. |
ncv |
Folds of cross-validation. Use |
intercept |
|
step2 |
|
progress |
|
A ggplot2
object of cross-validated prediction metric versus lambda
, stratified by a
. Detailed data can be retrieved from the ggplot2
object itself.
Teng Fei. Email: [email protected]
Fei T, Funnell T, Waters N, Raj SS et al. Scalable Log-ratio Lasso Regression Enhances Microbiome Feature Selection for Predictive Models. bioRxiv 2023.05.02.538599.
set.seed(23420) dat <- simu(n=50,p=30,model="linear") pmetric <- a.FLORAL(a=c(0.1,1),ncore=1,x=dat$xcount,y=dat$y,family="gaussian",ncv=2,progress=FALSE)
set.seed(23420) dat <- simu(n=50,p=30,model="linear") pmetric <- a.FLORAL(a=c(0.1,1),ncore=1,x=dat$xcount,y=dat$y,family="gaussian",ncv=2,progress=FALSE)
Conduct log-ratio lasso regression for continuous, binary and survival outcomes.
FLORAL( x, y, ncov = 0, family = "gaussian", longitudinal = FALSE, id = NULL, tobs = NULL, failcode = NULL, corstr = "exchangeable", scalefix = FALSE, scalevalue = 1, pseudo = 1, length.lambda = 100, lambda.min.ratio = NULL, ncov.lambda.weight = 0, a = 1, mu = 1, pfilter = 0, maxiter = 100, ncv = 5, ncore = 1, intercept = FALSE, foldid = NULL, step2 = TRUE, progress = TRUE, plot = TRUE )
FLORAL( x, y, ncov = 0, family = "gaussian", longitudinal = FALSE, id = NULL, tobs = NULL, failcode = NULL, corstr = "exchangeable", scalefix = FALSE, scalevalue = 1, pseudo = 1, length.lambda = 100, lambda.min.ratio = NULL, ncov.lambda.weight = 0, a = 1, mu = 1, pfilter = 0, maxiter = 100, ncv = 5, ncore = 1, intercept = FALSE, foldid = NULL, step2 = TRUE, progress = TRUE, plot = TRUE )
x |
Feature matrix, where rows specify subjects and columns specify features. The first |
y |
Outcome. For a continuous or binary outcome, |
ncov |
An integer indicating the number of first |
family |
Available options are |
longitudinal |
|
id |
If |
tobs |
If |
failcode |
If |
corstr |
If a GEE model is specified, then |
scalefix |
|
scalevalue |
Specify the scale parameter if |
pseudo |
Pseudo count to be added to |
length.lambda |
Number of penalty parameters used in the path |
lambda.min.ratio |
Ratio between the minimum and maximum choice of lambda. Default is |
ncov.lambda.weight |
Weight of the penalty lambda applied to the first |
a |
A scalar between 0 and 1: |
mu |
Value of penalty for the augmented Lagrangian |
pfilter |
A pre-specified threshold to force coefficients with absolute values less than pfilter times the maximum value of absolute coefficient as zeros in the GEE model. Default is zero, such that all coefficients will be reported. |
maxiter |
Number of iterations needed for the outer loop of the augmented Lagrangian algorithm. |
ncv |
Folds of cross-validation. Use |
ncore |
Number of cores for parallel computing for cross-validation. Default is 1. |
intercept |
|
foldid |
A vector of fold indicator. Default is |
step2 |
|
progress |
|
plot |
|
A list with path-specific estimates (beta), path (lambda), and others. Details can be found in README.md
.
Teng Fei. Email: [email protected]
Fei T, Funnell T, Waters N, Raj SS et al. Enhanced Feature Selection for Microbiome Data using FLORAL: Scalable Log-ratio Lasso Regression bioRxiv 2023.05.02.538599.
set.seed(23420) # Continuous outcome dat <- simu(n=50,p=30,model="linear") fit <- FLORAL(dat$xcount,dat$y,family="gaussian",ncv=2,progress=FALSE,step2=TRUE) # Binary outcome # dat <- simu(n=50,p=30,model="binomial") # fit <- FLORAL(dat$xcount,dat$y,family="binomial",progress=FALSE,step2=TRUE) # Survival outcome # dat <- simu(n=50,p=30,model="cox") # fit <- FLORAL(dat$xcount,survival::Surv(dat$t,dat$d),family="cox",progress=FALSE,step2=TRUE) # Competing risks outcome # dat <- simu(n=50,p=30,model="finegray") # fit <- FLORAL(dat$xcount,survival::Surv(dat$t,dat$d,type="mstate"),failcode=1, # family="finegray",progress=FALSE,step2=FALSE) # Longitudinal continuous outcome # dat <- simu(n=50,p=30,model="gee",geetype="gaussian",m=3,corstr="exchangeable",sdvec=rep(1,3)) # fit <- FLORAL(x=cbind(dat$tvec, dat$xcount),y=dat$y,id=dat$id,family="gaussian", # ncov=1,longitudinal = TRUE,corstr = "exchangeable",lambda.min.ratio=1e-3, # progress=FALSE,step2=FALSE)
set.seed(23420) # Continuous outcome dat <- simu(n=50,p=30,model="linear") fit <- FLORAL(dat$xcount,dat$y,family="gaussian",ncv=2,progress=FALSE,step2=TRUE) # Binary outcome # dat <- simu(n=50,p=30,model="binomial") # fit <- FLORAL(dat$xcount,dat$y,family="binomial",progress=FALSE,step2=TRUE) # Survival outcome # dat <- simu(n=50,p=30,model="cox") # fit <- FLORAL(dat$xcount,survival::Surv(dat$t,dat$d),family="cox",progress=FALSE,step2=TRUE) # Competing risks outcome # dat <- simu(n=50,p=30,model="finegray") # fit <- FLORAL(dat$xcount,survival::Surv(dat$t,dat$d,type="mstate"),failcode=1, # family="finegray",progress=FALSE,step2=FALSE) # Longitudinal continuous outcome # dat <- simu(n=50,p=30,model="gee",geetype="gaussian",m=3,corstr="exchangeable",sdvec=rep(1,3)) # fit <- FLORAL(x=cbind(dat$tvec, dat$xcount),y=dat$y,id=dat$id,family="gaussian", # ncov=1,longitudinal = TRUE,corstr = "exchangeable",lambda.min.ratio=1e-3, # progress=FALSE,step2=FALSE)
Summarizing FLORAL
outputs from multiple random k-fold cross validations
mcv.FLORAL( mcv = 10, ncore = 1, seed = NULL, x, y, ncov = 0, family = "gaussian", longitudinal = FALSE, id = NULL, tobs = NULL, failcode = NULL, corstr = "exchangeable", scalefix = FALSE, scalevalue = 1, pseudo = 1, length.lambda = 100, lambda.min.ratio = NULL, ncov.lambda.weight = 0, a = 1, mu = 1, pfilter = 0, maxiter = 100, ncv = 5, intercept = FALSE, step2 = TRUE, progress = TRUE, plot = TRUE )
mcv.FLORAL( mcv = 10, ncore = 1, seed = NULL, x, y, ncov = 0, family = "gaussian", longitudinal = FALSE, id = NULL, tobs = NULL, failcode = NULL, corstr = "exchangeable", scalefix = FALSE, scalevalue = 1, pseudo = 1, length.lambda = 100, lambda.min.ratio = NULL, ncov.lambda.weight = 0, a = 1, mu = 1, pfilter = 0, maxiter = 100, ncv = 5, intercept = FALSE, step2 = TRUE, progress = TRUE, plot = TRUE )
mcv |
Number of random 'ncv'-fold cross-validation to be performed. |
ncore |
Number of cores used for parallel computation. Default is to use only 1 core. |
seed |
A random seed for reproducibility of the results. By default the seed is the numeric form of |
x |
Feature matrix, where rows specify subjects and columns specify features. The first |
y |
Outcome. For a continuous or binary outcome, |
ncov |
An integer indicating the number of first |
family |
Available options are |
longitudinal |
|
id |
If |
tobs |
If |
failcode |
If |
corstr |
If a GEE model is specified, then |
scalefix |
|
scalevalue |
Specify the scale parameter if |
pseudo |
Pseudo count to be added to |
length.lambda |
Number of penalty parameters used in the path |
lambda.min.ratio |
Ratio between the minimum and maximum choice of lambda. Default is |
ncov.lambda.weight |
Weight of the penalty lambda applied to the first |
a |
A scalar between 0 and 1: |
mu |
Value of penalty for the augmented Lagrangian |
pfilter |
A pre-specified threshold to force coefficients with absolute values less than pfilter times the maximum value of absolute coefficient as zeros in the GEE model. Default is zero, such that all coefficients will be reported. |
maxiter |
Number of iterations needed for the outer loop of the augmented Lagrangian algorithm. |
ncv |
Folds of cross-validation. Use |
intercept |
|
step2 |
|
progress |
|
plot |
|
A list with relative frequencies of a certain feature being selected over mcv
ncv
-fold cross-validations.
Teng Fei. Email: [email protected]
Fei T, Funnell T, Waters N, Raj SS et al. Scalable Log-ratio Lasso Regression Enhances Microbiome Feature Selection for Predictive Models. bioRxiv 2023.05.02.538599.
set.seed(23420) dat <- simu(n=50,p=30,model="linear") fit <- mcv.FLORAL(mcv=2,ncore=1,x=dat$xcount,y=dat$y,ncv=2,progress=FALSE,step2=TRUE,plot=FALSE)
set.seed(23420) dat <- simu(n=50,p=30,model="linear") fit <- mcv.FLORAL(mcv=2,ncore=1,x=dat$xcount,y=dat$y,ncv=2,progress=FALSE,step2=TRUE,plot=FALSE)
Create data input list from phyloseq object
phy_to_floral_data(phy, y = NULL, covariates = NULL)
phy_to_floral_data(phy, y = NULL, covariates = NULL)
phy |
Phyloseq object |
y |
Outcome column of interest from phy's sample_data |
covariates |
Covariate column names from phy's sample_data |
list
library(phyloseq) data(GlobalPatterns) # add a covariate sample_data(GlobalPatterns)$test <- rep(c(1, 0), nsamples(GlobalPatterns)/2) # GlobalPatterns <- tax_glom(GlobalPatterns, "Phylum") dat <- phy_to_floral_data(GlobalPatterns, y = "test", covariates = c("SampleType")) # res <- FLORAL(x = dat$xcount, y=dat$y, ncov=dat$ncov, family = "binomial", ncv=NULL)
library(phyloseq) data(GlobalPatterns) # add a covariate sample_data(GlobalPatterns)$test <- rep(c(1, 0), nsamples(GlobalPatterns)/2) # GlobalPatterns <- tax_glom(GlobalPatterns, "Phylum") dat <- phy_to_floral_data(GlobalPatterns, y = "test", covariates = c("SampleType")) # res <- FLORAL(x = dat$xcount, y=dat$y, ncov=dat$ncov, family = "binomial", ncv=NULL)
Simulate a dataset from log-ratio model.
simu( n = 100, p = 200, model = "linear", weak = 4, strong = 6, weaksize = 0.125, strongsize = 0.25, pct.sparsity = 0.5, rho = 0, timedep_slope = NULL, timedep_cor = NULL, geetype = "gaussian", m = 4, corstr = "exchangeable", sdvec = NULL, rhogee = 0.8, geeslope = 2.5, longitudinal_stability = TRUE, ncov = 0, betacov = 0, intercept = FALSE )
simu( n = 100, p = 200, model = "linear", weak = 4, strong = 6, weaksize = 0.125, strongsize = 0.25, pct.sparsity = 0.5, rho = 0, timedep_slope = NULL, timedep_cor = NULL, geetype = "gaussian", m = 4, corstr = "exchangeable", sdvec = NULL, rhogee = 0.8, geeslope = 2.5, longitudinal_stability = TRUE, ncov = 0, betacov = 0, intercept = FALSE )
n |
An integer of sample size |
p |
An integer of number of features (taxa). |
model |
Type of models associated with outcome variable, can be "linear", "binomial", "cox", "finegray", "gee" (scalar outcome with time-dependent features), or "timedep" (survival endpoint with time-dependent features). |
weak |
Number of features with |
strong |
Number of features with |
weaksize |
Actual effect size for |
strongsize |
Actual effect size for |
pct.sparsity |
Percentage of zero counts for each sample. |
rho |
Parameter controlling the correlated structure between taxa. Ranges between 0 and 1. |
timedep_slope |
If |
timedep_cor |
If |
geetype |
If |
m |
If |
corstr |
If |
sdvec |
If |
rhogee |
If |
geeslope |
If |
longitudinal_stability |
If |
ncov |
Number of covariates that are not compositional features. |
betacov |
Coefficients corresponding to the covariates that are not compositional features. |
intercept |
Boolean. If TRUE, then a random intercept will be generated in the model. Only works for |
A list with simulated count matrix xcount
, log1p-transformed count matrix x
, outcome (continuous y
, continuous centered y0
, binary y
, or survival t
, d
), true coefficient vector beta
, list of non-zero features idx
, value of intercept intercept
(if applicable).
Teng Fei. Email: [email protected]
Fei T, Funnell T, Waters N, Raj SS et al. Enhanced Feature Selection for Microbiome Data using FLORAL: Scalable Log-ratio Lasso Regression bioRxiv 2023.05.02.538599.
set.seed(23420) dat <- simu(n=50,p=30,model="linear")
set.seed(23420) dat <- simu(n=50,p=30,model="linear")