Purpose

kernInt uses the kernel framework to unify supervised and unsupervised microbiome analyses, while paying special attention to spatial and temporal-related samples integration.

Installation and loading

In R console:

if (!requireNamespace("devtools")) install.packages("devtools")
devtools::install_github("elies-ramon/kernInt")

If the metagenomeSeq package was not installed previously:

if (!requireNamespace("BiocManager")) install.packages("BiocManager")
BiocManager::install("metagenomeSeq")

Once the package is installed, it can be loaded anytime typing:

library(kernInt)
#> Loading required package: kernlab
#> Warning: replacing previous import 'ggplot2::alpha' by 'kernlab::alpha' when
#> loading 'kernInt'
#> Registered S3 method overwritten by 'GGally':
#>   method from   
#>   +.gg   ggplot2
#> sROC 0.1-2 loaded

Package Overview

Main features

  • Implementation of kernels for compositional data.
  • Kernels derived from classical ecology distances, as Jaccard and Jensen-Shannon, are also available.
  • Implementation of kernels specific for functional data.
  • A previously unpublished longitudinal pig gut microbiome dataset
  • Automatic training/test splitting of the input data, k-Cross Validation and SVM classification and regression.
  • Microbial signatures of the classification and regression models.
  • Outliers / Novelty detection via SVMs.
  • Integration of data from different sources via Multiple Kernel Learning (MKL)

Available kernels

  • Widely used kernels for real vectors: the linear (lin) and RBF (rbf) kernels.
  • Kernels for compositional data: the compositional linear clin and Aitchison-RBF crbf kernels. Both kernels can be implemented directly from the raw counts from tables.
  • Kernels derived from ecological distances: the Jaccard jac and Jensen-Shanon jsk kernels. Metagenomic data should be normalized before applying these kernels, for example by using CSS (cumulative sum scaling).
  • Kernels specific for functional data: the functional linear flin and functional RBF frbf kernels. They need functions (for instance, representing time series) as an input.

Example data

We offer three metagenomic datasets with the package: a single point soil dataset, a human health dataset with an spatial component, and a novel longitudinal dataset concerning pig production. Also, to better illustrate the longitudinal treatment of data, we include the classical Berkeley Growth Dataset.

Soil data: Bacterial abundances (raw counts) in 88 soils from across North and South America. Metadata as soil pH, annual season precipitation and temperature, country, elevation, etc. is available. The dataset can be accessed typing soil:

Smokers: Microorganism abundances of oro and nasopharynx in 29 smokers and 33 nonsmokers. The dataset can be accessed typing smoker.

Pig data: Previously unpublished longitudinal gut microbiome dataset of 153 piglets during their first week of life. The dataset can be accessed typing pig.

Growth: Berkeley longitudinal height data of 54 girls and 39 boys (93 individuals in total) from ages 11 to 18. The dataset can be accessed typing growth.

Usage - Standard case

We refer data consisting in a sample of individuals (for example: human, animals, soil sites…) with no repeated measures as “the standard case”. The format of the input table must be with sample names as rows and taxon/OTUs as columns. The input of the ‘kernInt’ functions, thus, has class data.frame or matrix. We will use as example for this case the soil data.

kernel PCA

As the standard PCA, kernel PCA can be used to summarize, visualize and/or extract features of a dataset. Data can be projected in a linear or nonlinear way, depending on the kernel used. When the kernel is the standard linear (kernel=lin), kernel PCA is equivalent to standard PCA.

The kernPCA() function have two mandatory arguments - data and kernel:

kernPCA(data=soil$abund,kernel="clin")

The rest of arguments customize the plot. For example, the dots can be coloured according to a desired variable, which can be continuous or discrete. Here we show a kernel PCA of soil samples in which each sample is coloured according to its pH:

kernPCA(data=soil$abund,kernel="clin",  y=soil$metadata$ph, 
        col=c("aquamarine2","orange"),title = "Soil kernel PCA",legend = TRUE)

The projected data can be retrieved setting plot=FALSE.

Clustering

A dendogram plot presenting a hierarchical clustering of data can be obtained with:

soilData <- CSSnorm(data=soil$abund)  ### CSS normalization
hklust(data=soilData,kernel="jac",labels=1:nrow(soilData))

Additional arguments allow changing the agglomeration method, outlining clusters or customizing the plot:

hklust(data=soilData,kernel="jac", title = "Soil data cluster dendogram",cut=3,colors=c("coral3","orchid3","darkolivegreen3"))

The dendogram object can be retrieved setting plot=FALSE.

SVM regression

SVM regression is performed with the regress() function. regress() performs automatic training/test splitting of the input data, k-cross validation if requested, and regression with optimal hyperparameters. To perform MKL, go to MKL section.

The most basic call only needs three arguments: data (predictor variables; e.g. taxonomic abundances), y (target variable; e.g. a phenotype) and kernel.

For example, if we want to predict the pH of soil (y) from the abundances (data) using the compositional linear kernel:

modelreg <- regress(data=soil$abund, kernel="clin", y=soil$metadata$ph)

If the user has a pre-computed kernel matrix at hand, it can be passed as input to data. kernel should then be turned to kernel="matrix".

The SVM hyperparameters Cost (C) and Epsilon (E) can be specified, thus tuning how the model will adjust to the data. Roughly speaking, C is the cost of doing errors. In SVM regression models, differences between predicted and actual values smaller than E are not considered errors. Both increasing C and decreasing E increases the complexity of the model and the danger of overfitting.

regress(data=soil$abund, kernel="clin", y=soil$metadata$ph, C=5, E=0.001)

In addition, a generic kernel hyperparameter (H) can be specified. For example, if the chosen kernel is RBF, H will be interpreted as gamma: RBF(x,y) = exp(-gamma·||x-y||^2)

regress(data=soil$abund, y=soil$metadata$ph, kernel="crbf", C=5, H=0.1)

k-Cross Validation can be performed to obtain the optimal hyperparameters. This is done providing an argument to k. Then, the best hyperparameter value (or the best combination of hyperparameter values) will be selected among the values provided by the user:

regress(data=soil$abund, y=soil$metadata$ph, kernel="clin", C=c(1,5,10), E=c(0.001,0.1), k=5)

Training/test splitting is controlled with the p argument and the rownames of data. If given a numeric value between 0 and 1, regress() will consider it the proportion of data instances for the test set, and will do a random splitting. Default is p=0.2. Otherwise, a vector containing the indexes or the names of the rows of the test set is also allowed.

If the input data has repeated rownames, regress() will consider that the rows that share id are repeated measures coming from the same individual. The function will ensure that all repeated measures are used either to train or to test the model, but not for both, thus preserving the independence of the training and tets sets. However, users can enter the test partition, setting p to be a numeric (row indexes) or character (rownames) vector. The remainig data will be used as training.

Output

A list containing:

  • $nmse: Normalized mean squared error over the test data. This permits evaluating how good the model is at predicting.

  • $hyperparam: Hyperparameters’ values used to build the model and their cross-validation error (if applicable).

  • $prediction: Predicted and true values (test set). Rownames correspond to the indexes in the original data.

  • $var.imp: The variable importance (e.g. microbial signature) if a linear or linear-like kernel is used. To present relative values of

SVM classification

SVM classification is performed via the classify() function. Both binary or multiclass classification (one-vs-one) are supported. One-class classification is available in the outliers() function.

The usage of classify() is for the most part identical to that of regress(). For example, to predict if a certain soil came from forest, tropical, shrubland or grassland environment:

modelclas <- classify(data=soil$abund ,y=soil$metadata[ ,"env_feature"],kernel="clin")

Probabilistic classification is available setting prob=TRUE:

classify(data=soil$abund ,y=soil$metadata[ ,"env_feature"],kernel="clin",prob=TRUE)

The available hyperparameters are H and C (where C is the cost of missclassification). Also, classify() supports several methods to deal with imbalanced data:

-Class weighting:

classify(data=soil$abund ,y=soil$metadata[ ,"env_feature"],kernel="clin", classimb="weights",C=c(0.001,0.01),k=10)`

-Undersampling:

classify(data=soil$abund ,y=soil$metadata[ ,"env_feature"],kernel="clin", classimb="ubUnder",C=c(0.001,0.01),k=5)

-Oversampling:

classify(data=soil$abund ,y=soil$metadata[ ,"env_feature"],kernel="clin", classimb="ubOver",C=c(0.001,0.01),k=5)

Output

A list containing:

  • $conf.matrix: Confusion matrix (true versus predicted) for the test data. This permits evaluating how good the model is at predicting.

  • $hyperparam: Hyperparameters’ values used to build the model and their cross-validation error (if applicable).

  • $prediction: Predicted and true values (test set). Rownames correspond to the indexes in the original data. If prob=TRUE, the probability of each observation to belong to a given class.

  • $var.imp: The variable importance (e.g. microbial signature) if a linear or linear-like kernel is used.

Variable importances

Following Guyon2002, we can obtain the importance of a variable in a SVM model as:

imp <- modelreg$var.imp^2

Then, to generate a plot of the 10 most important features:

imp <- imp/sum(imp) ## To give relative importances
imp10 <- sort(imp,decreasing = TRUE)[1:10]
par(mar=c(4,5,4,2))
barplot(sort(imp10),horiz = TRUE,las=2,
main="Soil data top ten important taxa",xlab = "Relative Importance")

## To know the taxonomic classification of the ten most important OTUs, do:
soil$taxonomy[names(imp10),]

Outlier detection

The outliers() function can be used either in a supervised or in an unsupervised way.

In the latter approach, the most basic call to this function needs two arguments: data (predictor variables) and kernel (the kernel function used). Then, the function will return the data outliers:

outliers(data=soil$abund ,kernel="clin")

The nu hyperparameter (nu) and a gamma hyperparameter H can be entered:

outliers(data=soil$abund,kernel="crbf",nu=0.2,H=0.05)

If an argument for y is provided, outliers() functions as an one-class SVM. In that case, cross-validation will be performed if k has an argument. Also, p stands for the proportion (or indexes, or rownames) of data instances reserved for the test set.

outliers(data=soil$abund,y=soil$metadata[ ,"env_feature"],kernel="clin",nu=c(0.1,0.2),k=5)

Usage - Multiple data sources

The availability of multiple types of data for the same sample of individuals is becoming increasingly common. For instance, for several patients, we may have types as diverse as metagenomic data, metabolomics and blood analysis. Another example is when we have spatial-related samples, as in the smoker dataset. There, each individual is sampled in four body sites: left and right nasopharynx, and left and right oropharynx. Combining this different kind of data types can be tricky for most methods, but the kernel framework offers a straightforward solution: using specific kernels for each data source and then directly combining the kernel matrices. The process of obtaining an optimal convex combination is known as MKL (Multiple Kernel Learning).

MKL

MKL is available to classify(), regress() and outliers(). All features of these two functions are available when performing MKL. To do MKL, the data argument must be a list of length > 1. Each element of the list should be a data.frame or matrix, and rows should coincide. If kernel="matrix" data may be a three-dimensional array. kernel argument may contain only one kernel name (thus implying that the kernel is the same for all datasets) or a vector of kernel names. That way a different kernel will be applied to each data type. For example, if we have a list of metagenomic abundances, as in smoker$abund:

css_data <- lapply(smoker$abund,CSSnorm) ## CSS normalization
smoking <-  smoker$metadata$smoker[seq(from=1,to=62*4,by=4)] ## Target variable
classify(data=css_data, y=smoking, kernel="jac")

## This is equivalent to:

jacc_kern <- sapply(smoker$abund,qJacc,simplify = "array") ## 3D Jaccard array 
classify(data=jacc_kern, y=smoking, kernel="matrix")

The coeff argument is for the weight of each data type in the kernel combination. When absent, the mean across all kernel matrices is performed.

classify(data=jacc_kern, y=smoking,coeff = c(0.1,0.2,0.4,0.3) , kernel="matrix")

The use of additional arguments as C, E, p… remains the same. Kernel(s)’ generic hyperparameter in the MKL usage, H must be NULL or, either, a list so each element is the hyperparameter applied of each kernel function:

h <- list(nasL=0.001,nasR=0.1,oroL=0.01,oroR=0.01)
classify(data=smoker$abund, y=smoking,coeff = c(0.1,0.2,0.4,0.3), C=10, H=h, kernel="crbf")

In the case of k-Cross-Validation:

h <- list(nasL=c(0.01,0.001),nasR=c(0.01,0.001),oroL=0.0001,oroR=0.0001)
classify(data=smoker$abund, y=smoking,coeff = c(0.1,0.2,0.4,0.3), C=c(1,10), H=h, kernel="crbf",k=5)

Side functions for data fusion

KInt() and fuseData() return a fused kernel matrix, but the former uses kernel matrices as input while the latter needs a list with the different data sources.

KInt(jacc_kern)

We can use different kernel functions for each type of data; for example, the Jaccard kernel for the first data type and the compostional linear kernel for the second:

fuseData(DATA=css_data,kernel=c("jac","clin"))

The former command consider the two sources equally important. If not, we can state the weights:

fuseData(DATA=css_data,kernel=c("jac","lin"),coeff=c(0.9,0.1))

Usage - Longitudinal data

In this case, we have repeated samples for the same individuals indexed by time. Take as an example the growth dataset, which follow the growth over time of several girls and boys from birth until they turn 18-years-old:

library(ggplot2)
#> 
#> Attaching package: 'ggplot2'
#> The following object is masked from 'package:kernlab':
#> 
#>     alpha
target <- rep("Girl",nrow(growth))
target[ grep("boy",rownames(growth))] <- "Boy"
target <- as.factor(target)
growplot <- data.frame(rownames(growth),growth,target=target)
ggplot(growplot, aes(x = age, y = height, group=rownames.growth.,color = target)) + 
  geom_line()+ ggtitle("Growth spurt") + theme_bw()+ theme(legend.title = element_blank())

This kind of data is called longitudinal and is typically represented as functions. The coefficients of a simple polynomial fitting can be obtained with lsq() by least squares. For the growth dataset, we interpolate the growth curves with a polynomial of degree 2:

growth2 <- growth
colnames(growth2) <-  c("time", "height")
growth_coeff <- lsq(data=growth2,degree=2)

The kernel framework allows direct handling of complex data types as functions using specific kernels. ‘kernInt’ implements the functional linear (kernel=flin) and functional RBF (kernel=frbf) kernel. When used, a domain argument to evaluate the functions (e.g. a time interval) has to be provided. For instance, we use classify to predict from the growth curve if the individual was a girl or a boy:

target <- rep("Girl",93)
target[ grep("boy",rownames(growth_coeff$coef))] <- "Boy"
target <- as.factor(target)
cm <- array(0,dim=c(2,2,40))
for(i in 1:40) {
  model <- classify(data=growth_coeff,kernel="frbf",C=c(1,5,10,50), H=0.0001,domain=c(1,18), y=target,k=5)
  cm[,,i] <- model$conf.matrix
}

We may compare this approach to perform a nonlongitudinal prediction using only the last time point (age 18):

cm2 <- array(0,dim=c(2,2,40))
for(i in 1:40) {
  model <- classify(data=matrix(growth[which(growth[,1]==18),2]),kernel="rbf",C=c(1,5,10,50), H=0.001, y=target,k=5)
  cm2[,,i] <- model$conf.matrix
}

 acc <- matrix(0,nrow=40,ncol=4)
 colnames(acc) <- c("Acc.Long","Acc.18y","F1.Long","F1.18y")
  for(i in 1:40)  {
   acc[i,1] <- Acc(cm[,,i])
   acc[i,2] <- Acc(cm2[,,i])
   acc[i,3] <- F1(cm[,,i])
   acc[i,4] <- F1(cm2[,,i])
  }
 boxplot(acc,main="Accuracy and F1")

It can be observed that taking account all time points delivers a much better prediction performance that keeping a single time point, even if it is that of maximum separation between the two groups.

Additional help

A thourough, argument-by-argument documentation is available for each function with:

help(regress) ## or the specific name of the function
?regress

The documentation of the example datasets is available in an analogous way, typing:

help(soil) ## or the specific name of the example dataset
?soil