mushi.kSFS

class kSFS(X=None, mutation_types=None, file=None, n=None)[source]

Bases: object

Primary class for working with SFS data to infer demography \(\eta(t)\equiv 2N(t)\), or with \(k\)-SFS data to infer demography and mutation spectrum history \(\boldsymbol\mu(t)\).

X

\(k\)-SFS matrix (or 1D SFS vector)

Type

numpy.ndarray

eta

demographic history

Type

mushi.eta

mu

mutation spectrum history

Type

mushi.mu

mutation_types

mutation spectrum history

Type

List[str]

n

number of sampled haplotypes

Type

int

Parameters
  • ksfs_file – path to \(k\)-SFS file, as ouput by mutyper ksfs

  • X (Optional[ndarray]) – \(k\)-SFS matrix

  • mutation_types (Optional[List[str]]) – list of names of X columns

  • n (Optional[int]) – number of sampled haplotypes

Examples

>>> import mushi
>>> import numpy as np

Three constructors:

  1. ksfs_file: path to k-SFS file, as ouput by mutyper ksfs

>>> ksfs = mushi.kSFS(file='ksfs.tsv') 
  1. X and mutation_types (the latter may be ommitted if X is a 1D SFS array)

>>> sfs = mushi.kSFS(X=np.array([10, 5, 3, 1]))
>>> ksfs = mushi.kSFS(X=np.ones((10, 4)),
...                   mutation_types=['AAA>ACA', 'ACA>AAA',
...                                   'TCC>TTC', 'GAA>GGA'])
  1. n: number of haplotypes to initialize for simulation

>>> ksfs = mushi.kSFS(n=100)

Methods

as_df

Return a pandas DataFrame representation

clear_eta

Clear demographic history attribute η

clear_mu

Clear μ attribute

clustermap

Clustermap of compositionally centralized k-SFS

infer_eta

Infer demographic history \(\eta(t)\)

infer_misid

Infer ancestral misidentification rate with \(\eta(t)\) fixed.

infer_mush

Infer mutation spectrum history \(\mu(t)\)

loss

Loss under current history.

plot

Plot the \(k\)-SFS

plot_total

Plot the SFS using matplotlib

set_eta

Set pre-specified demographic history \(\eta(t)\)

simulate

Simulate a \(k\)-SFS under the Poisson random field model (no linkage), assign to X attribute

tmrca_cdf

The CDF of the TMRCA at each change point in eta

Attributes

eta

Read-only alias to η attribute

mu

Read-only alias to μ attribute

property eta: mushi.histories.eta

Read-only alias to η attribute

Return type

eta

property mu: mushi.histories.mu

Read-only alias to μ attribute

Return type

mu

as_df()[source]

Return a pandas DataFrame representation

Return type

DataFrame

clear_eta()[source]

Clear demographic history attribute η

Return type

None

clear_mu()[source]

Clear μ attribute

Return type

None

tmrca_cdf(eta=None)[source]

The CDF of the TMRCA at each change point in eta

Parameters

eta (Optional[eta]) – demographic history (if None, use self.eta attribute)

Return type

ndarray

simulate(eta, mu, r=0, seed=None)[source]

Simulate a \(k\)-SFS under the Poisson random field model (no linkage), assign to X attribute

Parameters
  • eta (eta) – demographic history

  • mu (Union[mu, float64]) – mutation spectrum history (or constant mutation rate)

  • r (float64) – ancestral state misidentification rate (default 0)

  • seed (Optional[int]) – random seed

Examples

Define sample size:

>>> ksfs = mushi.kSFS(n=10)

Define demographic history and MuSH:

>>> eta = mushi.eta(np.array([1, 100, 10000]), np.array([1e4, 1e4, 1e2, 1e4]))
>>> mush = mushi.mu(eta.change_points, np.ones((4, 4)),
...                 ['AAA>ACA', 'ACA>AAA', 'TCC>TTC', 'GAA>GGA'])

Define ancestral misidentification rate:

>>> r = 0.03

Set random seed:

>>> seed = 0

Run simulation and print simulated \(k\)-SFS

>>> ksfs.simulate(eta, mush, r, seed)
>>> ksfs.as_df() 
mutation type     AAA>ACA  ACA>AAA  TCC>TTC  GAA>GGA
sample frequency
1                    1118     1123     1106     1108
2                     147      128      120       98
3                      65       55       66       60
4                      49       52       64       46
5                      44       43       34       36
6                      35       28       36       33
7                      23       32       24       35
8                      34       32       24       24
9                      52       41       57       56
Return type

None

set_eta(eta)[source]

Set pre-specified demographic history \(\eta(t)\)

Parameters

eta (eta) – demographic history object

infer_misid(mu0, loss='prf', max_iter=100, tol=0, line_search_kwargs={}, verbose=False)[source]

Infer ancestral misidentification rate with \(\eta(t)\) fixed. This function is used for fitting with a pre-specified demography, after using set_eta(), instead of inferring \(\eta(t)\) with infer_eta() (which jointly infers misidentification rate).

Parameters
  • mu0 (float64) – total mutation rate (per genome per generation)

  • loss (str) – loss function name from loss_functions module

  • max_iter (int) – maximum number of optimization steps

  • tol (float64) – relative tolerance in objective function (if 0, not used)

  • line_search_kwargs (Dict) – line search keyword arguments, see mushi.optimization.LineSearcher()

  • verbose (bool) – print verbose messages if True

Return type

None

infer_eta(mu0, *trend_penalty, ridge_penalty=0, folded=False, pts=100, ta=None, log_transform=True, eta=None, eta_ref=None, loss='prf', max_iter=100, tol=0, line_search_kwargs={}, trend_kwargs={}, verbose=False)[source]

Infer demographic history \(\eta(t)\)

Parameters
  • mu0 (float64) – total mutation rate (per genome per generation)

  • trend_penalty (Tuple[int, float64]) – tuple (k, λ) for \(k\)-th order trend penalty (can pass multiple for mixed trends)

  • ridge_penalty (float64) – ridge penalty

  • folded (bool) – if False, infer \(\eta(t)\) using unfolded SFS. If True, can only be used with infer_mu=False, and infer \(\eta(t)\) using folded SFS.

  • pts (float64) – number of points for time discretization

  • ta (Optional[float64]) – time (in WF generations ago) of oldest change point in time discretization. If None, set automatically based on 10 * E[TMRCA] under MLE constant demography

  • log_transform (bool) – fit \(\log\eta(t)\)

  • eta (Optional[eta]) – initial demographic history. By default, a constant MLE is computed

  • eta_ref (Optional[eta]) – reference demographic history for ridge penalty. If None, the constant MLE is used

  • loss (str) – loss function name from loss_functions module

  • max_iter (int) – maximum number of optimization steps

  • tol (float64) – relative tolerance in objective function (if 0, not used)

  • line_search_kwargs (Dict) – line search keyword arguments, see mushi.optimization.LineSearcher()

  • trend_kwargs (Dict) – keyword arguments for trend filtering, see mushi.optimization.TrendFilter.run()

  • verbose (bool) – print verbose messages if True

Examples

Suppose ksfs is a kSFS object. Then the following fits a demographic history with 0-th order (piecewise constant) trend penalization of strength 100, assuming a mutation rate of 10 mutations per genome per generation.

>>> mu0 = 1
>>> ksfs.infer_eta(mu0, (0, 1e2))

Alternatively, a mixed trend solution, with constant and cubic pieces, is fit with

>>> ksfs.infer_eta(mu0, (0, 1e2), (3, 1e1))

The attribute ksfs.eta is now set and accessable for plotting (see eta).

Return type

None

infer_mush(*trend_penalty, ridge_penalty=0, rank_penalty=0, hard=False, mu_ref=None, misid_penalty=0.0001, loss='prf', max_iter=100, tol=0, line_search_kwargs={}, trend_kwargs={}, verbose=False)[source]

Infer mutation spectrum history \(\mu(t)\)

Parameters
  • trend_penalty (Tuple[int, float64]) – tuple (k, λ) for \(k\)-th order trend penalty (can pass multiple for mixed trends)

  • ridge_penalty (float64) – ridge penalty

  • rank_penalty (float64) – rank penalty

  • hard (bool) – hard rank penalty (non-convex)

  • mu_ref (Optional[mu]) – reference MuSH for ridge penalty. If None, the constant MLE is used

  • misid_penalty (float64) – ridge parameter to shrink misid rates to aggregate rate

  • loss (str) – loss function from loss_functions module

  • max_iter (int) – maximum number of optimization steps

  • tol (float64) – relative tolerance in objective function (if 0, not used)

  • line_search_kwargs (Dict) – line search keyword arguments, see mushi.optimization.LineSearcher

  • trend_kwargs (Dict) – keyword arguments for trend filtering, see mushi.optimization.TrendFilter.run()

  • verbose (bool) – print verbose messages if True

Examples

Suppose ksfs is a kSFS object, and the demography has already been fit with infer_eta. The following fits a mutation spectrum history with 0-th order (piecewise constant) trend penalization of strength 100.

>>> ksfs.infer_mush((0, 1e2))

Alternatively, a mixed trend solution, with constant and cubic pieces, and with rank penalization 100, is fit with

>>> ksfs.infer_mush((0, 1e2), (3, 1e1), rank_penalty=1e2)

The attribute ksfs.mu is now set and accessable for plotting (see mu).

Return type

None

plot_total(kwargs={'ls': '', 'marker': '.'}, line_kwargs={}, fill_kwargs={}, folded=False)[source]

Plot the SFS using matplotlib

Parameters
  • kwargs (Dict) – keyword arguments for scatter plot

  • line_kwargs (Dict) – keyword arguments for expectation line

  • fill_kwargs (Dict) – keyword arguments for marginal fill

  • folded (bool) – if True, plot the folded SFS and fit

Return type

None

plot(types=None, clr=False, kwargs={'ls': '', 'marker': '.', 'rasterized': True}, line_kwargs={})[source]

Plot the \(k\)-SFS

Parameters
  • types – iterable of mutation type names to restrict plotting to

  • clr (bool) – flag to normalize to total mutation intensity and display as centered log ratio transform

  • kwargs (Dict) – key word arguments passed to data scatter plot

  • line_kwargs (Dict) – key word arguments passed to expectation line plot

Return type

None

clustermap(**kwargs)[source]

Clustermap of compositionally centralized k-SFS

Parameters

kwargs – additional keyword arguments passed to pandas.clustermap

Return type

None

loss(func='prf')[source]

Loss under current history.

Parameters

func – loss function name from loss_functions module

Example

After fitting demography and/or MuSH (with infer_eta and infer_mush), the loss (goodness of fit) may be evaluated as

>>> ksfs.loss()
-31584.27...
Return type

float64