mushi.composition.ancom

ancom(table, grouping, alpha=0.05, tau=0.02, theta=0.1, multiple_comparisons_correction=None, significance_test=None)[source]

Performs a differential abundance test using ANCOM.

This is done by calculating pairwise log ratios between all features and performing a significance test to determine if there is a significant difference in feature ratios with respect to the variable of interest.

In an experiment with only two treatments, this test tests the following hypothesis for feature \(i\)

\[H_{0i}: \mathbb{E}[\ln(u_i^{(1)})] = \mathbb{E}[\ln(u_i^{(2)})]\]

where \(u_i^{(1)}\) is the mean abundance for feature \(i\) in the first group and \(u_i^{(2)}\) is the mean abundance for feature \(i\) in the second group.

Parameters:

table (pd.DataFrame) – A 2D matrix of strictly positive values (i.e. counts or proportions) where the rows correspond to samples and the columns correspond to features.
grouping (pd.Series) – Vector indicating the assignment of samples to groups. For example, these could be strings or integers denoting which group a sample belongs to. It must be the same length as the samples in table. The index must be the same on table and grouping but need not be in the same order.
alpha (float, optional) – Significance level for each of the statistical tests. This can can be anywhere between 0 and 1 exclusive.
tau (float, optional) – A constant used to determine an appropriate cutoff. A value close to zero indicates a conservative cutoff. This can can be anywhere between 0 and 1 exclusive.
theta (float, optional) – Lower bound for the proportion for the W-statistic. If all W-statistics are lower than theta, then no features will be detected to be differentially significant. This can can be anywhere between 0 and 1 exclusive.
multiple_comparisons_correction ({None, 'holm-bonferroni'}, optional) – The multiple comparison correction procedure to run. If None, then no multiple comparison correction procedure will be run. If ‘holm-boniferroni’ is specified, then the Holm-Boniferroni procedure [1] will be run.
significance_test (function, optional) – A statistical significance function to test for significance between classes. This function must be able to accept at least two 1D array_like arguments of floats and returns a test statistic and a p-value. By default scipy.stats.f_oneway is used.

Returns:

A table of features, their W-statistics and whether the null hypothesis is rejected.

”W” is the W-statistic, or number of features that a single feature is tested to be significantly different against.

”reject” indicates if feature is significantly different or not.

Return type:

pd.DataFrame

Notes

The developers of this method recommend the following significance tests ([2], Supplementary File 1, top of page 11): the standard parametric t-test (scipy.stats.ttest_ind) or one-way ANOVA (scipy.stats.f_oneway) if the number of groups is greater than 2, or non-parametric variants such as Wilcoxon rank sum (scipy.stats.wilcoxon) or Kruskal-Wallis (scipy.stats.kruskal) if the number of groups is greater than 2. Because one-way ANOVA is equivalent to the standard t-test when the number of groups is two, we default to scipy.stats.f_oneway here, which can be used when there are two or more groups. Users should refer to the documentation of these tests in SciPy to understand the assumptions made by each test.

This method cannot handle any zero counts as input, since the logarithm of zero cannot be computed. While this is an unsolved problem, many studies have shown promising results by replacing the zeros with pseudo counts. This can be also be done via the multiplicative_replacement method.

References

Examples

First import all of the necessary modules:

>>> import mushi.composition as cmp
>>> import pandas as pd

Now let’s load in a pd.DataFrame with 6 samples and 7 unknown bacteria:

>>> table = pd.DataFrame([[12, 11, 10, 10, 10, 10, 10],
...                       [9,  11, 12, 10, 10, 10, 10],
...                       [1,  11, 10, 11, 10, 5,  9],
...                       [22, 21, 9,  10, 10, 10, 10],
...                       [20, 22, 10, 10, 13, 10, 10],
...                       [23, 21, 14, 10, 10, 10, 10]],
...                      index=['s1','s2','s3','s4','s5','s6'],
...                      columns=['b1','b2','b3','b4','b5','b6','b7'])

Then create a grouping vector. In this scenario, there are only two classes, and suppose these classes correspond to the treatment due to a drug and a control. The first three samples are controls and the last three samples are treatments.

>>> grouping = pd.Series([0, 0, 0, 1, 1, 1],
...                      index=['s1','s2','s3','s4','s5','s6'])

Now run ancom and see if there are any features that have any significant differences between the treatment and the control.

>>> results = cmp.ancom(table, grouping) 
>>> results['W'] 
b1    0
b2    4
b3    1
b4    1
b5    1
b6    0
b7    1
Name: W, dtype: np.int64

The W-statistic is the number of features that a single feature is tested to be significantly different against. In this scenario, b2 was detected to have significantly different abundances compared to four of the other species. To summarize the results from the W-statistic, let’s take a look at the results from the hypothesis test:

>>> results['reject'] 
b1    False
b2     True
b3    False
b4    False
b5    False
b6    False
b7    False
Name: reject, dtype: bool

From this we can conclude that only b2 was significantly different between the treatment and the control.