mushi.kSFS
- class kSFS(X=None, mutation_types=None, file=None, n=None)[source]
Bases:
object
Primary class for working with SFS data to infer demography \(\eta(t)\equiv 2N(t)\), or with \(k\)-SFS data to infer demography and mutation spectrum history \(\boldsymbol\mu(t)\).
- X
\(k\)-SFS matrix (or 1D SFS vector)
- Type
- mutation_types
mutation spectrum history
- Type
List[str]
- Parameters
Examples
>>> import mushi >>> import numpy as np
Three constructors:
ksfs_file
: path to k-SFS file, as ouput by mutyper ksfs
>>> ksfs = mushi.kSFS(file='ksfs.tsv')
X
andmutation_types
(the latter may be ommitted ifX
is a 1D SFS array)
>>> sfs = mushi.kSFS(X=np.array([10, 5, 3, 1]))
>>> ksfs = mushi.kSFS(X=np.ones((10, 4)), ... mutation_types=['AAA>ACA', 'ACA>AAA', ... 'TCC>TTC', 'GAA>GGA'])
n
: number of haplotypes to initialize for simulation
>>> ksfs = mushi.kSFS(n=100)
Methods
Return a pandas DataFrame representation
Clear demographic history attribute η
Clear μ attribute
Clustermap of compositionally centralized k-SFS
Infer demographic history \(\eta(t)\)
Infer ancestral misidentification rate with \(\eta(t)\) fixed.
Infer mutation spectrum history \(\mu(t)\)
Loss under current history.
Plot the \(k\)-SFS
Plot the SFS using matplotlib
Set pre-specified demographic history \(\eta(t)\)
Simulate a \(k\)-SFS under the Poisson random field model (no linkage), assign to
X
attributeThe CDF of the TMRCA at each change point in
eta
Attributes
Read-only alias to η attribute
Read-only alias to μ attribute
- property eta: mushi.histories.eta
Read-only alias to η attribute
- Return type
- property mu: mushi.histories.mu
Read-only alias to μ attribute
- Return type
- simulate(eta, mu, r=0, seed=None)[source]
Simulate a \(k\)-SFS under the Poisson random field model (no linkage), assign to
X
attribute- Parameters
Examples
Define sample size:
>>> ksfs = mushi.kSFS(n=10)
Define demographic history and MuSH:
>>> eta = mushi.eta(np.array([1, 100, 10000]), np.array([1e4, 1e4, 1e2, 1e4])) >>> mush = mushi.mu(eta.change_points, np.ones((4, 4)), ... ['AAA>ACA', 'ACA>AAA', 'TCC>TTC', 'GAA>GGA'])
Define ancestral misidentification rate:
>>> r = 0.03
Set random seed:
>>> seed = 0
Run simulation and print simulated \(k\)-SFS
>>> ksfs.simulate(eta, mush, r, seed)
>>> ksfs.as_df() mutation type AAA>ACA ACA>AAA TCC>TTC GAA>GGA sample frequency 1 1118 1123 1106 1108 2 147 128 120 98 3 65 55 66 60 4 49 52 64 46 5 44 43 34 36 6 35 28 36 33 7 23 32 24 35 8 34 32 24 24 9 52 41 57 56
- Return type
- set_eta(eta)[source]
Set pre-specified demographic history \(\eta(t)\)
- Parameters
eta (
eta
) – demographic history object
- infer_misid(mu0, loss='prf', max_iter=100, tol=0, line_search_kwargs={}, verbose=False)[source]
Infer ancestral misidentification rate with \(\eta(t)\) fixed. This function is used for fitting with a pre-specified demography, after using
set_eta()
, instead of inferring \(\eta(t)\) withinfer_eta()
(which jointly infers misidentification rate).- Parameters
mu0 (
float64
) – total mutation rate (per genome per generation)loss (
str
) – loss function name fromloss_functions
modulemax_iter (
int
) – maximum number of optimization stepstol (
float64
) – relative tolerance in objective function (if0
, not used)line_search_kwargs (
Dict
) – line search keyword arguments, seemushi.optimization.LineSearcher()
verbose (
bool
) – print verbose messages ifTrue
- Return type
- infer_eta(mu0, *trend_penalty, ridge_penalty=0, folded=False, pts=100, ta=None, log_transform=True, eta=None, eta_ref=None, loss='prf', max_iter=100, tol=0, line_search_kwargs={}, trend_kwargs={}, verbose=False)[source]
Infer demographic history \(\eta(t)\)
- Parameters
mu0 (
float64
) – total mutation rate (per genome per generation)trend_penalty (
Tuple
[int
,float64
]) – tuple(k, λ)
for \(k\)-th order trend penalty (can pass multiple for mixed trends)ridge_penalty (
float64
) – ridge penaltyfolded (
bool
) – ifFalse
, infer \(\eta(t)\) using unfolded SFS. IfTrue
, can only be used withinfer_mu=False
, and infer \(\eta(t)\) using folded SFS.pts (
float64
) – number of points for time discretizationta (
Optional
[float64
]) – time (in WF generations ago) of oldest change point in time discretization. IfNone
, set automatically based on 10 * E[TMRCA] under MLE constant demographylog_transform (
bool
) – fit \(\log\eta(t)\)eta (
Optional
[eta
]) – initial demographic history. By default, a constant MLE is computedeta_ref (
Optional
[eta
]) – reference demographic history for ridge penalty. IfNone
, the constant MLE is usedloss (
str
) – loss function name fromloss_functions
modulemax_iter (
int
) – maximum number of optimization stepstol (
float64
) – relative tolerance in objective function (if0
, not used)line_search_kwargs (
Dict
) – line search keyword arguments, seemushi.optimization.LineSearcher()
trend_kwargs (
Dict
) – keyword arguments for trend filtering, seemushi.optimization.TrendFilter.run()
verbose (
bool
) – print verbose messages ifTrue
Examples
Suppose
ksfs
is akSFS
object. Then the following fits a demographic history with 0-th order (piecewise constant) trend penalization of strength 100, assuming a mutation rate of 10 mutations per genome per generation.>>> mu0 = 1 >>> ksfs.infer_eta(mu0, (0, 1e2))
Alternatively, a mixed trend solution, with constant and cubic pieces, is fit with
>>> ksfs.infer_eta(mu0, (0, 1e2), (3, 1e1))
The attribute
ksfs.eta
is now set and accessable for plotting (seeeta
).- Return type
- infer_mush(*trend_penalty, ridge_penalty=0, rank_penalty=0, hard=False, mu_ref=None, misid_penalty=0.0001, loss='prf', max_iter=100, tol=0, line_search_kwargs={}, trend_kwargs={}, verbose=False)[source]
Infer mutation spectrum history \(\mu(t)\)
- Parameters
trend_penalty (
Tuple
[int
,float64
]) – tuple(k, λ)
for \(k\)-th order trend penalty (can pass multiple for mixed trends)ridge_penalty (
float64
) – ridge penaltyrank_penalty (
float64
) – rank penaltyhard (
bool
) – hard rank penalty (non-convex)mu_ref (
Optional
[mu
]) – reference MuSH for ridge penalty. If None, the constant MLE is usedmisid_penalty (
float64
) – ridge parameter to shrink misid rates to aggregate rateloss (
str
) – loss function fromloss_functions
modulemax_iter (
int
) – maximum number of optimization stepstol (
float64
) – relative tolerance in objective function (if0
, not used)line_search_kwargs (
Dict
) – line search keyword arguments, seemushi.optimization.LineSearcher
trend_kwargs (
Dict
) – keyword arguments for trend filtering, seemushi.optimization.TrendFilter.run()
verbose (
bool
) – print verbose messages ifTrue
Examples
Suppose
ksfs
is akSFS
object, and the demography has already been fit withinfer_eta
. The following fits a mutation spectrum history with 0-th order (piecewise constant) trend penalization of strength 100.>>> ksfs.infer_mush((0, 1e2))
Alternatively, a mixed trend solution, with constant and cubic pieces, and with rank penalization 100, is fit with
>>> ksfs.infer_mush((0, 1e2), (3, 1e1), rank_penalty=1e2)
The attribute
ksfs.mu
is now set and accessable for plotting (seemu
).- Return type
- plot_total(kwargs={'ls': '', 'marker': '.'}, line_kwargs={}, fill_kwargs={}, folded=False)[source]
Plot the SFS using matplotlib
- plot(types=None, clr=False, kwargs={'ls': '', 'marker': '.', 'rasterized': True}, line_kwargs={})[source]
Plot the \(k\)-SFS
- Parameters
types – iterable of mutation type names to restrict plotting to
clr (
bool
) – flag to normalize to total mutation intensity and display as centered log ratio transformkwargs (
Dict
) – key word arguments passed to data scatter plotline_kwargs (
Dict
) – key word arguments passed to expectation line plot
- Return type
- clustermap(**kwargs)[source]
Clustermap of compositionally centralized k-SFS
- Parameters
kwargs – additional keyword arguments passed to pandas.clustermap
- Return type
- loss(func='prf')[source]
Loss under current history.
- Parameters
func – loss function name from
loss_functions
module
Example
After fitting demography and/or MuSH (with
infer_eta
andinfer_mush
), the loss (goodness of fit) may be evaluated as>>> ksfs.loss() -31584.27...
- Return type
float64