Quickstart
This notebook demonstrates how to use mutyper
for computing mutation
type data for VCF/BCF data
We will access some example data from a directory in the GitHub repo.
Python API demo
import mutyper
Path to an ancestral FASTA file for human chromosome 1
fasta = 'docs/example_data/ancestor.fa'
Print the first 3 lines
with open(fasta) as f:
for _ in range(3):
print(f.readline().rstrip())
>1
TTTGTGGGAgACTATTCCTCCCATCTGCAACAGCTGCCCCTGCTGACTGCCCTTCTCTCCTCCCTCTCATCCCAGAGAAACAgGTCAGCTGGGAGCTTCT
GCCCCCACTGCCTAGGGACCAACAGGGGCAGGAGGCAGTCACTGACCCCGAGACGTTTGCATCCTGCACAGCTAGAGATCCTTTATTAAAAGCACACTGT
The mutyper
package has a single class Ancestor
. We instantiate
an Ancestor
object using our FASTA file - We’re interested in 3-mer
context, which we can indicate with the k
keyword argument (by
default it will be 3 though).
ancestor = mutyper.Ancestor(fasta, k=3)
We can inspect FASTA record names, showing that the FASTA contains two
records, chromosome '1'
and '2'
.
print(ancestor.keys())
odict_keys(['1', '2'])
and we can Pythonically slice sequences via fast random access without loading into memory (leaning on pyfaidx under the hood):
start = 100
end = 300
print(repr(ancestor['1'][start:end]))
>1:101-300
GCCCCCACTGCCTAGGGACCAACAGGGGCAGGAGGCAGTCACTGACCCCGAGACGTTTGCATCCTGCACAGCTAGAGATCCTTTATTAAAAGCACACTGTTGGTTTCTGCTCAGTTCTTTATTGATTGGTGTGCCGTTTTCTCTGGAAGCCTCTTAAGAACACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGgAG
We can also access these FastaRecord
slices as string-like biopython
Seq
objects
print(ancestor['1'][start:end].seq)
GCCCCCACTGCCTAGGGACCAACAGGGGCAGGAGGCAGTCACTGACCCCGAGACGTTTGCATCCTGCACAGCTAGAGATCCTTTATTAAAAGCACACTGTTGGTTTCTGCTCAGTTCTTTATTGATTGGTGTGCCGTTTTCTCTGGAAGCCTCTTAAGAACACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGgAG
The mutation_type
method allows us to specify a SNP by the usual
CHROM, POS, REF, ALT, and returns the correctly polarized mutation type
as a tuple of ancestral kmer and derived kmer.
First, consider the triplet context at site 100 on chromosome 1:
print(repr(ancestor['1'][98:101]))
>1:99-101
CTG
Suppose a biallelic SNP at this site segregates in a population with reference allele T and alternative allele A. The mutation type of this SNP is
print(ancestor.mutation_type('1', 99, 'T', 'A'))
('CAG', 'CTG')
Note that, by default, mutation types are collapsed by reverse complementation such that the ancestral state at the target site is A or C. So in the above, the ancestral state T caused the ancestral triplet CTG to be reversed and complemented to CAG.
If the reference and alternative states are both distinct from the ancestral state, the site is not biallelic (an infinite sites violation), and the mutation type is returned as ambiguous.
print(ancestor.mutation_type('1', 99, 'C', 'A'))
(None, None)
The region_contexts
method returns a generator of ancestral contexts
(by default collapsed as described above) over the positions in a region
specified as in BED file format CHROM, START, END
start = 50
end = 100
for context in ancestor.region_contexts('1', start, end):
print(context, end=' ')
CCC CCT AAG GAA TCT GAG TCT GAG TCC CCT GAG TCC CCC CCT GAG TCT GAG TCA CAT GAT TCC CCC CCA CAG TCT GAG TCT GAA AAA AAC ACA None None None GAC TCA CAG GCT GCT CAG CCA CCC TCC GAG GCT GCT AAG GAA TCT CAG
Note that FASTA sites that are not nucleotide characters have
non-identified ancestral state, so context is None
.
This generator method is used by the the targets
method to compute
the masked genomic target size for each \(k\)-mer context from a BED
mask file. This may be useful for normalizing spectra in different
genomic regions, or calibrating mutation rates.
bed = 'docs/example_data/mask.bed'
with open(bed) as f:
for _ in range(3):
print(f.readline().rstrip())
1 25 50
1 150 175
2 50 100
Under the hood it loops over BED file entries and updates a Counter
object for the different triplets
print(ancestor.targets(bed))
Counter({'CAG': 10, 'GCA': 9, 'GCT': 7, 'TCT': 7, 'AAA': 7, 'CAT': 5, 'GAT': 5, 'GCC': 4, 'GAG': 4, 'TAG': 4, 'TAA': 4, 'CCA': 4, 'AAC': 3, 'CCC': 3, 'CCT': 3, 'TCC': 3, 'AAT': 3, 'CAA': 2, 'ACA': 2, 'TCA': 2, 'GAC': 2, 'ACG': 2, 'ACT': 1, 'CAC': 1, 'GAA': 1, 'AAG': 1, 'ACC': 1})
We can also call it without the BED file mask to compute the unmasked target sizes of the ancestral FASTA
print(ancestor.targets())
Counter({'AAA': 6197, 'CAG': 4423, 'TCT': 4319, 'AAT': 4240, 'ACA': 3920, 'GAA': 3861, 'AAG': 3860, 'CCT': 3811, 'CCA': 3764, 'TCA': 3730, 'GAG': 3601, 'TAA': 3527, 'CAT': 3456, 'CAA': 3408, 'TCC': 3317, 'TAT': 3279, 'CAC': 3260, 'CCC': 3234, 'GCA': 3182, 'ACT': 2998, 'GCT': 2972, 'GCC': 2694, 'AAC': 2688, 'ACC': 2378, 'GAT': 2370, 'TAG': 2210, 'GAC': 1998, 'TAC': 1927, 'CCG': 717, 'GCG': 592, 'ACG': 554, 'TCG': 450})
Command line interface demo
Display usage information for the mutyper command
$ mutyper -h
usage: mutyper [-h] {ancestor,variants,targets,spectra,ksfs} ...
mutyper: ancestral kmer mutation types for variant data
options:
-h, --help show this help message and exit
subcommands:
specify one of these
{ancestor,variants,targets,spectra,ksfs}
additional help available for each subcommand
variants
subcommand
Usage:
$ mutyper variants -h
usage: mutyper variants [-h] [--verbose] [--k K] [--target TARGET] [--sep SEP]
[--chrom_pos CHROM_POS] [--strand_file STRAND_FILE]
[--strict]
fasta vcf
adds mutation_type to VCF/BCF INFO, polarizes REF/ALT/AC according to
ancestral/derived states, and stream to stdout
positional arguments:
fasta path to ancestral FASTA
vcf VCF/BCF file, usually for a single chromosome ("-" for
stdin)
options:
-h, --help show this help message and exit
--verbose increase logging verbosity
--k K k-mer context size (default 3)
--target TARGET 0-based mutation target position in kmer (default
middle)
--sep SEP field delimiter in FASTA headers (default ":")
--chrom_pos CHROM_POS
0-based chromosome field position in FASTA headers
(default 0)
--strand_file STRAND_FILE
path to bed file with regions where reverse strand
defines mutation context, e.g. direction of
replication or transcription (default collapse reverse
complements)
--strict only uppercase nucleotides in FASTA considered
ancestrally identified
Path to FASTA and a truncated VCF file for chromosome 1 from the 1000 Genomes Project.
$ fasta = 'docs/example_data/ancestral.fa'
$ vcf = 'docs/example_data/snps.vcf'
We’ll use bcftools to display the first few variants, omitting header and genotype information.
$ bcftools view -HG $vcf | head -5
1 10177 . A AC 100 PASS AC=2130;AF=0.425319;AN=5008;NS=2504;DP=103152;EAS_AF=0.3363;AMR_AF=0.3602;AFR_AF=0.4909;EUR_AF=0.4056;SAS_AF=0.4949;AA=|||unknown(NO_COVERAGE)
1 10235 . T TA 100 PASS AC=6;AF=0.00119808;AN=5008;NS=2504;DP=78015;EAS_AF=0;AMR_AF=0.0014;AFR_AF=0;EUR_AF=0;SAS_AF=0.0051;AA=|||unknown(NO_COVERAGE)
1 10352 rs145072688 T TA 100 PASS AC=2191;AF=0.4375;AN=5008;NS=2504;DP=88915;EAS_AF=0.4306;AMR_AF=0.4107;AFR_AF=0.4788;EUR_AF=0.4264;SAS_AF=0.4192;AA=|||unknown(NO_COVERAGE)
1 10505 . A T 100 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9632;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;EUR_AF=0;SAS_AF=0;AA=.|||
1 10506 . C G 100 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9676;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;EUR_AF=0;SAS_AF=0;AA=.|||
Now we use the variants
subcommand, and pipe to bcftools
for
display. Notice there is an additional INFO field mutation_type
, and
only biallelic SNPs are output. Also note that REF/ALT state and INFO/AC
fields are polarized according to ancestral/derived allele.
$ mutyper variants $fasta $vcf | bcftools view -HG | head -5
1 10505 . A T 100 PASS AC=5007;AF=0.9998;AN=5008;NS=2504;DP=9632;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;EUR_AF=0;SAS_AF=0;AA=.|||;mutation_type=GAG>GTG
1 10506 . C G 100 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9676;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;EUR_AF=0;SAS_AF=0;AA=.|||;mutation_type=TCT>TGT
1 10542 . A G 100 PASS AC=5007;AF=0.9998;AN=5008;NS=2504;DP=9007;EAS_AF=0.001;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_AF=0;AA=.|||;mutation_type=CAG>CGG
1 10642 . A G 100 PASS AC=4987;AF=0.995807;AN=5008;NS=2504;DP=1360;EAS_AF=0.003;AMR_AF=0.0014;AFR_AF=0.0129;EUR_AF=0;SAS_AF=0;AA=.|||;mutation_type=CAT>CGT
1 11008 . C G 100 PASS AC=441;AF=0.0880591;AN=5008;NS=2504;DP=2232;EAS_AF=0.0367;AMR_AF=0.0965;AFR_AF=0.1346;EUR_AF=0.0885;SAS_AF=0.0716;AA=.|||;mutation_type=CCT>CGT
targets
subcommand
Usage:
$ mutyper targets -h
usage: mutyper targets [-h] [--verbose] [--k K] [--target TARGET] [--sep SEP]
[--chrom_pos CHROM_POS] [--strand_file STRAND_FILE]
[--strict] [--bed BED]
fasta
compute 𝑘-mer target sizes and stream to stdout
positional arguments:
fasta path to ancestral FASTA
options:
-h, --help show this help message and exit
--verbose increase logging verbosity
--k K k-mer context size (default 3)
--target TARGET 0-based mutation target position in kmer (default
middle)
--sep SEP field delimiter in FASTA headers (default ":")
--chrom_pos CHROM_POS
0-based chromosome field position in FASTA headers
(default 0)
--strand_file STRAND_FILE
path to bed file with regions where reverse strand
defines mutation context, e.g. direction of
replication or transcription (default collapse reverse
complements)
--strict only uppercase nucleotides in FASTA considered
ancestrally identified
--bed BED path to BED file mask ("-" for stdin)
$ mutyper targets $fasta | head
AAA 6331
AAC 2775
AAG 3947
AAT 4335
ACA 4066
ACC 2440
ACG 632
ACT 3082
CAA 3506
CAC 3391
The optional argument --bed
resticts target size computation based
on a BED mask file, and is useful of normalizing mutation spectra
according to genomic region, or calibrating mutaiton rates.
$ mutyper targets $fasta --bed $bed | head
AAA 7
AAC 3
AAG 1
AAT 3
ACA 2
ACC 1
ACG 2
ACT 1
CAA 2
CAC 1
spectra
subcommand
Usage:
$ mutyper spectra -h
usage: mutyper spectra [-h] [--verbose] [--population] [--randomize] vcf
compute mutation spectra for each sample in VCF/BCF with mutation_type data
and stream to stdout
positional arguments:
vcf VCF/BCF file, usually for a single chromosome ("-" for stdin)
options:
-h, --help show this help message and exit
--verbose increase logging verbosity
--population population-wise spectrum, instead of individual
--randomize randomly assign mutation to a single haplotype
We first add mutation type information with the variants
subcommand,
then pipe into the spectra
subcommand to generate mutation spectra
for all samples, and finally crop the output with head
and cut
for display.
$ mutyper variants $fasta $vcf | mutyper spectra - | head | cut -f-5
sample AAA>AGA AAT>AGT ACA>AAA ACA>AGA
HG00096 2 2 2 2
HG00097 2 2 2 2
HG00099 2 2 2 2
HG00100 2 2 2 2
HG00101 2 2 2 2
HG00102 2 2 2 2
HG00103 2 2 2 2
HG00105 2 2 2 2
HG00106 2 2 2 2
Use the --population
to get the spectrum for whole population,
rather than each individual
$ mutyper variants $fasta $vcf | mutyper spectra - --population | head | cut -f-5
AAA>AGA AAT>AGT ACA>AAA ACA>AGA ACA>ATA
2 1 1 2 3
ksfs
subcommand
Usage:
$ mutyper ksfs -h
usage: mutyper ksfs [-h] [--verbose] vcf
compute sample frequency spectrum for each mutation type from a VCF/BCF file
with mutation_type data (i.e. output from variants subcommand ) and stream to
stdout
positional arguments:
vcf VCF/BCF file, usually for a single chromosome ("-" for stdin)
options:
-h, --help show this help message and exit
--verbose increase logging verbosity
Similar to the previous subcommand, but now we are generating a sample frequency spectrum (SFS) for each mutation type.
$ mutyper variants $fasta $vcf | mutyper ksfs - | head | cut -f-5
sample_frequency AAA>AGA AAT>AGT ACA>AAA ACA>AGA
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
9 0 0 0 0