On Microbiome Simulators

1 minute read

Published:

Published tools

CAMISIM: simulating metagenomes and microbial communities is a recent 2019 paper that simulates microbial communities and metagenomes. It was originally created to generate simulated datasets for the first CAMI challenge (Critical Assessment of Metagenome Interpretation).

MetaSim is a 2008 paper that describes a method to simulated individual read datasets for planning out sequencing projects and benchmarking metagenomic analysis software.

Other tools to try out

From the Huttenhower Lab, SparseDOSSA is a R based tool that can simulate OTU tables that have correlation structure if you want to do some sort of association testing. A nice feature is that you can “calibrate” your simulations using real data which sets the parameters of the log normal distribution based on your dataset. The examples online show how to use the tool on the command line, but you can also install it to use in RStudio :

install.packages(pkgs="~/Documents/sparsedossa"
                 , repos = NULL, type="source")

The output is in the form of an object where the the ‘OTU_count’ value is a matrix where the first few rows contain the metadata, and the “truth” value contains the spiked taxa that should be associated with your metadata

my_metadata <-matrix(c(rep(1,90),rep(2,90)),nrow=1,ncol=n.samples0)
df_sim = sparseDOSSA::sparseDOSSA( number_features = n.microbes, 
                                     number_samples = n.samples, 
                                     number_metadata = n.metadata_real, 
                                     datasetCount = n.dataset, 
                                     percent_spiked =spike.perc,
                                     seed=3,
                                     association_type = 'linear',
                                     read_depth = 10000,
                                     spikeCount = "1",
                                     spikeStrength = "1.0",
                                     UserMetadata = my_metadata)

otu_table = as.matrix(df_sim$OTU_count)[[1]]  
associations = df_sim$truth                                

A few things to be aware of: the spikeCount and spikeStrength needs to be written in string form.