mushi.kSFS

class kSFS(X=None, mutation_types=None, file=None, n=None)[source]

Bases: object

Primary class for working with SFS data to infer demography \(\eta(t)\equiv 2N(t)\), or with \(k\)-SFS data to infer demography and mutation spectrum history \(\boldsymbol\mu(t)\).

X

\(k\)-SFS matrix (or 1D SFS vector)

Type:: numpy.ndarray

eta

demographic history

Type:: mushi.eta

mu

mutation spectrum history

Type:: mushi.mu

mutation_types

mutation spectrum history

Type:: List[str]

n

number of sampled haplotypes

Type:: int

Parameters:

ksfs_file – path to \(k\)-SFS file, as ouput by mutyper ksfs
X (Array) – \(k\)-SFS matrix
mutation_types (List[str]) – list of names of X columns
n (int) – number of sampled haplotypes

Examples

>>> import mushi
>>> import numpy as np

Three constructors:

ksfs_file: path to k-SFS file, as ouput by mutyper ksfs

>>> ksfs = mushi.kSFS(file='ksfs.tsv') 

X and mutation_types (the latter may be ommitted if X is a 1D SFS array)

>>> sfs = mushi.kSFS(X=np.array([10, 5, 3, 1]))

>>> ksfs = mushi.kSFS(X=np.ones((10, 4)),
...                   mutation_types=['AAA>ACA', 'ACA>AAA',
...                                   'TCC>TTC', 'GAA>GGA'])

n: number of haplotypes to initialize for simulation

>>> ksfs = mushi.kSFS(n=100)

Methods

`as_df`	Return a pandas DataFrame representation.
`clear_eta`	Clear demographic history attribute η.
`clear_mu`	Clear μ attribute.
`clustermap`	Clustermap of compositionally centralized k-SFS.
`infer_eta`	Infer demographic history \(\eta(t)\)
`infer_misid`	Infer ancestral misidentification rate with \(\eta(t)\) fixed.
`infer_mush`	Infer mutation spectrum history \(\mu(t)\)
`loss`	Loss under current history.
`plot`	Plot the \(k\)-SFS.
`plot_total`	Plot the SFS using matplotlib.
`set_eta`	Set pre-specified demographic history \(\eta(t)\)
`simulate`	Simulate a \(k\)-SFS under the Poisson random field model (no linkage), assign to `X` attribute.
`tmrca_cdf`	The CDF of the TMRCA at each change point in `eta`

Attributes

`eta`	Read-only alias to η attribute.
`mu`	Read-only alias to μ attribute.

property eta: eta: Read-only alias to η attribute.

property mu: mu: Read-only alias to μ attribute.

as_df()[source]

Return a pandas DataFrame representation.

Return type:: DataFrame

clear_eta()[source]

Clear demographic history attribute η.

Return type:: None

clear_mu()[source]

Clear μ attribute.

Return type:: None

tmrca_cdf(eta=None)[source]

The CDF of the TMRCA at each change point in eta

Parameters:: eta (eta) – demographic history (if None, use self.eta attribute)
Return type:: ndarray

simulate(eta, mu, r=0, seed=None)[source]

Simulate a \(k\)-SFS under the Poisson random field model (no linkage), assign to X attribute.

Parameters:

eta (eta) – demographic history
mu (Union[mu, float64]) – mutation spectrum history (or constant mutation rate)
r (float64) – ancestral state misidentification rate (default 0)
seed (int) – random seed

Return type:

None

Examples

Define sample size:

>>> ksfs = mushi.kSFS(n=10)

Define demographic history and MuSH:

>>> eta = mushi.eta(np.array([1, 100, 10000]), np.array([1e4, 1e4, 1e2, 1e4]))
>>> mush = mushi.mu(eta.change_points, np.ones((4, 4)),
...                 ['AAA>ACA', 'ACA>AAA', 'TCC>TTC', 'GAA>GGA'])

Define ancestral misidentification rate:

>>> r = 0.03

Set random seed:

>>> seed = 0

Run simulation and print simulated \(k\)-SFS

>>> ksfs.simulate(eta, mush, r, seed)

>>> ksfs.as_df() 
mutation type     AAA>ACA  ACA>AAA  TCC>TTC  GAA>GGA
sample frequency
                  1118     1123     1106     1108
                   147      128      120       98
                    65       55       66       60
                    49       52       64       46
                    44       43       34       36
                    35       28       36       33
                    23       32       24       35
                    34       32       24       24
                    52       41       57       56

set_eta(eta)[source]

Set pre-specified demographic history \(\eta(t)\)

Parameters:: eta (eta) – demographic history object

infer_misid(mu0, loss='prf', max_iter=100, tol=0, line_search_kwargs={}, verbose=False)[source]

Infer ancestral misidentification rate with \(\eta(t)\) fixed. This function is used for fitting with a pre-specified demography, after using set_eta(), instead of inferring \(\eta(t)\) with infer_eta() (which jointly infers misidentification rate).

Parameters:

mu0 (float64) – total mutation rate (per genome per generation)
loss (str) – loss function name from loss_functions module
max_iter (int) – maximum number of optimization steps
tol (float64) – relative tolerance in objective function (if 0, not used)
line_search_kwargs (Dict) – line search keyword arguments, see mushi.optimization.LineSearcher()
verbose (bool) – print verbose messages if True

Return type:

None

infer_eta(mu0, *trend_penalty, ridge_penalty=0, folded=False, pts=100, ta=None, log_transform=True, eta=None, eta_ref=None, loss='prf', max_iter=100, tol=0, line_search_kwargs={}, trend_kwargs={}, verbose=False)[source]

Infer demographic history \(\eta(t)\)

Parameters:

mu0 (float64) – total mutation rate (per genome per generation)
trend_penalty (Tuple[int, float64]) – tuple (k, λ) for \(k\)-th order trend penalty (can pass multiple for mixed trends)
ridge_penalty (float64) – ridge penalty
folded (bool) – if False, infer \(\eta(t)\) using unfolded SFS. If True, can only be used with infer_mu=False, and infer \(\eta(t)\) using folded SFS.
pts (float64) – number of points for time discretization
ta (float64) – time (in WF generations ago) of oldest change point in time discretization. If None, set automatically based on 10 * E[TMRCA] under MLE constant demography
log_transform (bool) – fit \(\log\eta(t)\)
eta (eta) – initial demographic history. By default, a constant MLE is computed
eta_ref (eta) – reference demographic history for ridge penalty. If None, the constant MLE is used
loss (str) – loss function name from loss_functions module
max_iter (int) – maximum number of optimization steps
tol (float64) – relative tolerance in objective function (if 0, not used)
line_search_kwargs (Dict) – line search keyword arguments, see mushi.optimization.LineSearcher()
trend_kwargs (Dict) – keyword arguments for trend filtering, see mushi.optimization.TrendFilter.run()
verbose (bool) – print verbose messages if True

Return type:

None

Examples

Suppose ksfs is a kSFS object. Then the following fits a demographic history with 0-th order (piecewise constant) trend penalization of strength 100, assuming a mutation rate of 10 mutations per genome per generation.

>>> mu0 = 1
>>> ksfs.infer_eta(mu0, (0, 1e2))

Alternatively, a mixed trend solution, with constant and cubic pieces, is fit with

>>> ksfs.infer_eta(mu0, (0, 1e2), (3, 1e1))

The attribute ksfs.eta is now set and accessable for plotting (see eta).

infer_mush(*trend_penalty, ridge_penalty=0, rank_penalty=0, hard=False, mu_ref=None, misid_penalty=0.0001, loss='prf', max_iter=100, tol=0, line_search_kwargs={}, trend_kwargs={}, verbose=False)[source]

Infer mutation spectrum history \(\mu(t)\)

Parameters:

trend_penalty (Tuple[int, float64]) – tuple (k, λ) for \(k\)-th order trend penalty (can pass multiple for mixed trends)
ridge_penalty (float64) – ridge penalty
rank_penalty (float64) – rank penalty
hard (bool) – hard rank penalty (non-convex)
mu_ref (mu) – reference MuSH for ridge penalty. If None, the constant MLE is used
misid_penalty (float64) – ridge parameter to shrink misid rates to aggregate rate
loss (str) – loss function from loss_functions module
max_iter (int) – maximum number of optimization steps
tol (float64) – relative tolerance in objective function (if 0, not used)
line_search_kwargs (Dict) – line search keyword arguments, see mushi.optimization.LineSearcher
trend_kwargs (Dict) – keyword arguments for trend filtering, see mushi.optimization.TrendFilter.run()
verbose (bool) – print verbose messages if True

Return type:

None

Examples

Suppose ksfs is a kSFS object, and the demography has already been fit with infer_eta. The following fits a mutation spectrum history with 0-th order (piecewise constant) trend penalization of strength 100.

>>> ksfs.infer_mush((0, 1e2))

Alternatively, a mixed trend solution, with constant and cubic pieces, and with rank penalization 100, is fit with

>>> ksfs.infer_mush((0, 1e2), (3, 1e1), rank_penalty=1e2)

The attribute ksfs.mu is now set and accessable for plotting (see mu).

plot_total(kwargs={'ls': '', 'marker': '.'}, line_kwargs={}, fill_kwargs={}, folded=False)[source]

Plot the SFS using matplotlib.

Parameters:

kwargs (Dict) – keyword arguments for scatter plot
line_kwargs (Dict) – keyword arguments for expectation line
fill_kwargs (Dict) – keyword arguments for marginal fill
folded (bool) – if True, plot the folded SFS and fit

Return type:

None

plot(types=None, clr=False, kwargs={'ls': '', 'marker': '.', 'rasterized': True}, line_kwargs={})[source]

Plot the \(k\)-SFS.

Parameters:

types – iterable of mutation type names to restrict plotting to
clr (bool) – flag to normalize to total mutation intensity and display as centered log ratio transform
kwargs (Dict) – key word arguments passed to data scatter plot
line_kwargs (Dict) – key word arguments passed to expectation line plot

Return type:

None

clustermap(**kwargs)[source]

Clustermap of compositionally centralized k-SFS.

Parameters:: kwargs – additional keyword arguments passed to pandas.clustermap
Return type:: None

loss(func='prf')[source]

Loss under current history.

Parameters:: func – loss function name from loss_functions module
Return type:: float64

Example

After fitting demography and/or MuSH (with infer_eta and infer_mush), the loss (goodness of fit) may be evaluated as

>>> ksfs.loss()
np.float64(-31584.27...)