Quickstart ========== This notebook demonstrates how to use ``mutyper`` for computing mutation type data for VCF/BCF data We will access some `example data `_ from a directory in the GitHub repo. Python API demo --------------- .. testsetup:: # we move up to the root of the repo, so that example data paths in rendered docs # can be given wrt the root import os os.chdir('..') .. testcode:: import mutyper Path to an ancestral FASTA file for human chromosome 1 .. testcode:: fasta = 'docs/example_data/ancestor.fa' Print the first 3 lines .. testcode:: with open(fasta) as f: for _ in range(3): print(f.readline().rstrip()) .. testoutput:: >1 TTTGTGGGAgACTATTCCTCCCATCTGCAACAGCTGCCCCTGCTGACTGCCCTTCTCTCCTCCCTCTCATCCCAGAGAAACAgGTCAGCTGGGAGCTTCT GCCCCCACTGCCTAGGGACCAACAGGGGCAGGAGGCAGTCACTGACCCCGAGACGTTTGCATCCTGCACAGCTAGAGATCCTTTATTAAAAGCACACTGT The ``mutyper`` package has a single class ``Ancestor``. We instantiate an ``Ancestor`` object using our FASTA file - We’re interested in 3-mer context, which we can indicate with the ``k`` keyword argument (by default it will be 3 though). .. testcode:: ancestor = mutyper.Ancestor(fasta, k=3) We can inspect FASTA record names, showing that the FASTA contains two records, chromosome ``'1'`` and ``'2'``. .. testcode:: print(ancestor.keys()) .. testoutput:: odict_keys(['1', '2']) and we can Pythonically slice sequences via fast random access without loading into memory (leaning on `pyfaidx `__ under the hood): .. testcode:: start = 100 end = 300 print(repr(ancestor['1'][start:end])) .. testoutput:: >1:101-300 GCCCCCACTGCCTAGGGACCAACAGGGGCAGGAGGCAGTCACTGACCCCGAGACGTTTGCATCCTGCACAGCTAGAGATCCTTTATTAAAAGCACACTGTTGGTTTCTGCTCAGTTCTTTATTGATTGGTGTGCCGTTTTCTCTGGAAGCCTCTTAAGAACACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGgAG We can also access these ``FastaRecord`` slices as string-like biopython ``Seq`` objects .. testcode:: print(ancestor['1'][start:end].seq) .. testoutput:: GCCCCCACTGCCTAGGGACCAACAGGGGCAGGAGGCAGTCACTGACCCCGAGACGTTTGCATCCTGCACAGCTAGAGATCCTTTATTAAAAGCACACTGTTGGTTTCTGCTCAGTTCTTTATTGATTGGTGTGCCGTTTTCTCTGGAAGCCTCTTAAGAACACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGgAG The ``mutation_type`` method allows us to specify a SNP by the usual CHROM, POS, REF, ALT, and returns the correctly polarized mutation type as a tuple of ancestral kmer and derived kmer. First, consider the triplet context at site 100 on chromosome 1: .. testcode:: print(repr(ancestor['1'][98:101])) .. testoutput:: >1:99-101 CTG Suppose a biallelic SNP at this site segregates in a population with reference allele T and alternative allele A. The mutation type of this SNP is .. testcode:: print(ancestor.mutation_type('1', 99, 'T', 'A')) .. testoutput:: ('CAG', 'CTG') Note that, by default, mutation types are collapsed by reverse complementation such that the ancestral state at the target site is A or C. So in the above, the ancestral state T caused the ancestral triplet CTG to be reversed and complemented to CAG. If the reference and alternative states are both distinct from the ancestral state, the site is not biallelic (an infinite sites violation), and the mutation type is returned as ambiguous. .. testcode:: print(ancestor.mutation_type('1', 99, 'C', 'A')) .. testoutput:: (None, None) The ``region_contexts`` method returns a generator of ancestral contexts (by default collapsed as described above) over the positions in a region specified as in BED file format CHROM, START, END .. testcode:: start = 50 end = 100 for context in ancestor.region_contexts('1', start, end): print(context, end=' ') .. testoutput:: :options: +NORMALIZE_WHITESPACE CCC CCT AAG GAA TCT GAG TCT GAG TCC CCT GAG TCC CCC CCT GAG TCT GAG TCA CAT GAT TCC CCC CCA CAG TCT GAG TCT GAA AAA AAC ACA None None None GAC TCA CAG GCT GCT CAG CCA CCC TCC GAG GCT GCT AAG GAA TCT CAG Note that FASTA sites that are not nucleotide characters have non-identified ancestral state, so context is ``None``. This generator method is used by the the ``targets`` method to compute the masked genomic target size for each :math:`k`-mer context from a BED mask file. This may be useful for normalizing spectra in different genomic regions, or calibrating mutation rates. .. testcode:: bed = 'docs/example_data/mask.bed' .. testcode:: with open(bed) as f: for _ in range(3): print(f.readline().rstrip()) .. testoutput:: :options: +NORMALIZE_WHITESPACE 1 25 50 1 150 175 2 50 100 Under the hood it loops over BED file entries and updates a ``Counter`` object for the different triplets .. testcode:: print(ancestor.targets(bed)) .. testoutput:: Counter({'CAG': 10, 'GCA': 9, 'GCT': 7, 'TCT': 7, 'AAA': 7, 'CAT': 5, 'GAT': 5, 'GCC': 4, 'GAG': 4, 'TAG': 4, 'TAA': 4, 'CCA': 4, 'AAC': 3, 'CCC': 3, 'CCT': 3, 'TCC': 3, 'AAT': 3, 'CAA': 2, 'ACA': 2, 'TCA': 2, 'GAC': 2, 'ACG': 2, 'ACT': 1, 'CAC': 1, 'GAA': 1, 'AAG': 1, 'ACC': 1}) We can also call it without the BED file mask to compute the unmasked target sizes of the ancestral FASTA .. testcode:: print(ancestor.targets()) .. testoutput:: Counter({'AAA': 6197, 'CAG': 4423, 'TCT': 4319, 'AAT': 4240, 'ACA': 3920, 'GAA': 3861, 'AAG': 3860, 'CCT': 3811, 'CCA': 3764, 'TCA': 3730, 'GAG': 3601, 'TAA': 3527, 'CAT': 3456, 'CAA': 3408, 'TCC': 3317, 'TAT': 3279, 'CAC': 3260, 'CCC': 3234, 'GCA': 3182, 'ACT': 2998, 'GCT': 2972, 'GCC': 2694, 'AAC': 2688, 'ACC': 2378, 'GAT': 2370, 'TAG': 2210, 'GAC': 1998, 'TAC': 1927, 'CCG': 717, 'GCG': 592, 'ACG': 554, 'TCG': 450}) Command line interface demo --------------------------- Display usage information for the mutyper command .. code:: console $ mutyper -h .. parsed-literal:: usage: mutyper [-h] {ancestor,variants,targets,spectra,ksfs} ... mutyper: ancestral kmer mutation types for variant data options: -h, --help show this help message and exit subcommands: specify one of these {ancestor,variants,targets,spectra,ksfs} additional help available for each subcommand ``variants`` subcommand ~~~~~~~~~~~~~~~~~~~~~~~ Usage: .. code:: console $ mutyper variants -h .. parsed-literal:: usage: mutyper variants [-h] [--verbose] [--k K] [--target TARGET] [--sep SEP] [--chrom_pos CHROM_POS] [--strand_file STRAND_FILE] [--strict] fasta vcf adds mutation_type to VCF/BCF INFO, polarizes REF/ALT/AC according to ancestral/derived states, and stream to stdout positional arguments: fasta path to ancestral FASTA vcf VCF/BCF file, usually for a single chromosome ("-" for stdin) options: -h, --help show this help message and exit --verbose increase logging verbosity --k K k-mer context size (default 3) --target TARGET 0-based mutation target position in kmer (default middle) --sep SEP field delimiter in FASTA headers (default ":") --chrom_pos CHROM_POS 0-based chromosome field position in FASTA headers (default 0) --strand_file STRAND_FILE path to bed file with regions where reverse strand defines mutation context, e.g. direction of replication or transcription (default collapse reverse complements) --strict only uppercase nucleotides in FASTA considered ancestrally identified Path to FASTA and a truncated VCF file for chromosome 1 from the 1000 Genomes Project. .. code:: console $ fasta = 'docs/example_data/ancestral.fa' $ vcf = 'docs/example_data/snps.vcf' We'll use `bcftools `__ to display the first few variants, omitting header and genotype information. .. code:: console $ bcftools view -HG $vcf | head -5 .. parsed-literal:: 1 10177 . A AC 100 PASS AC=2130;AF=0.425319;AN=5008;NS=2504;DP=103152;EAS_AF=0.3363;AMR_AF=0.3602;AFR_AF=0.4909;EUR_AF=0.4056;SAS_AF=0.4949;AA=|||unknown(NO_COVERAGE) 1 10235 . T TA 100 PASS AC=6;AF=0.00119808;AN=5008;NS=2504;DP=78015;EAS_AF=0;AMR_AF=0.0014;AFR_AF=0;EUR_AF=0;SAS_AF=0.0051;AA=|||unknown(NO_COVERAGE) 1 10352 rs145072688 T TA 100 PASS AC=2191;AF=0.4375;AN=5008;NS=2504;DP=88915;EAS_AF=0.4306;AMR_AF=0.4107;AFR_AF=0.4788;EUR_AF=0.4264;SAS_AF=0.4192;AA=|||unknown(NO_COVERAGE) 1 10505 . A T 100 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9632;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;EUR_AF=0;SAS_AF=0;AA=.||| 1 10506 . C G 100 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9676;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;EUR_AF=0;SAS_AF=0;AA=.||| Now we use the ``variants`` subcommand, and pipe to ``bcftools`` for display. Notice there is an additional INFO field ``mutation_type``, and only biallelic SNPs are output. Also note that REF/ALT state and INFO/AC fields are polarized according to ancestral/derived allele. .. code:: console $ mutyper variants $fasta $vcf | bcftools view -HG | head -5 .. parsed-literal:: 1 10505 . A T 100 PASS AC=5007;AF=0.9998;AN=5008;NS=2504;DP=9632;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;EUR_AF=0;SAS_AF=0;AA=.|||;mutation_type=GAG>GTG 1 10506 . C G 100 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9676;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;EUR_AF=0;SAS_AF=0;AA=.|||;mutation_type=TCT>TGT 1 10542 . A G 100 PASS AC=5007;AF=0.9998;AN=5008;NS=2504;DP=9007;EAS_AF=0.001;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_AF=0;AA=.|||;mutation_type=CAG>CGG 1 10642 . A G 100 PASS AC=4987;AF=0.995807;AN=5008;NS=2504;DP=1360;EAS_AF=0.003;AMR_AF=0.0014;AFR_AF=0.0129;EUR_AF=0;SAS_AF=0;AA=.|||;mutation_type=CAT>CGT 1 11008 . C G 100 PASS AC=441;AF=0.0880591;AN=5008;NS=2504;DP=2232;EAS_AF=0.0367;AMR_AF=0.0965;AFR_AF=0.1346;EUR_AF=0.0885;SAS_AF=0.0716;AA=.|||;mutation_type=CCT>CGT ``targets`` subcommand ~~~~~~~~~~~~~~~~~~~~~~ Usage: .. code:: console $ mutyper targets -h .. parsed-literal:: usage: mutyper targets [-h] [--verbose] [--k K] [--target TARGET] [--sep SEP] [--chrom_pos CHROM_POS] [--strand_file STRAND_FILE] [--strict] [--bed BED] fasta compute 𝑘-mer target sizes and stream to stdout positional arguments: fasta path to ancestral FASTA options: -h, --help show this help message and exit --verbose increase logging verbosity --k K k-mer context size (default 3) --target TARGET 0-based mutation target position in kmer (default middle) --sep SEP field delimiter in FASTA headers (default ":") --chrom_pos CHROM_POS 0-based chromosome field position in FASTA headers (default 0) --strand_file STRAND_FILE path to bed file with regions where reverse strand defines mutation context, e.g. direction of replication or transcription (default collapse reverse complements) --strict only uppercase nucleotides in FASTA considered ancestrally identified --bed BED path to BED file mask ("-" for stdin) .. code:: console $ mutyper targets $fasta | head .. parsed-literal:: AAA 6331 AAC 2775 AAG 3947 AAT 4335 ACA 4066 ACC 2440 ACG 632 ACT 3082 CAA 3506 CAC 3391 The optional argument ``--bed`` resticts target size computation based on a BED mask file, and is useful of normalizing mutation spectra according to genomic region, or calibrating mutaiton rates. .. code:: console $ mutyper targets $fasta --bed $bed | head .. parsed-literal:: AAA 7 AAC 3 AAG 1 AAT 3 ACA 2 ACC 1 ACG 2 ACT 1 CAA 2 CAC 1 ``spectra`` subcommand ~~~~~~~~~~~~~~~~~~~~~~ Usage: .. code:: console $ mutyper spectra -h .. parsed-literal:: usage: mutyper spectra [-h] [--verbose] [--population] [--randomize] vcf compute mutation spectra for each sample in VCF/BCF with mutation_type data and stream to stdout positional arguments: vcf VCF/BCF file, usually for a single chromosome ("-" for stdin) options: -h, --help show this help message and exit --verbose increase logging verbosity --population population-wise spectrum, instead of individual --randomize randomly assign mutation to a single haplotype We first add mutation type information with the ``variants`` subcommand, then pipe into the ``spectra`` subcommand to generate mutation spectra for all samples, and finally crop the output with ``head`` and ``cut`` for display. .. code:: console $ mutyper variants $fasta $vcf | mutyper spectra - | head | cut -f-5 .. parsed-literal:: sample AAA>AGA AAT>AGT ACA>AAA ACA>AGA HG00096 2 2 2 2 HG00097 2 2 2 2 HG00099 2 2 2 2 HG00100 2 2 2 2 HG00101 2 2 2 2 HG00102 2 2 2 2 HG00103 2 2 2 2 HG00105 2 2 2 2 HG00106 2 2 2 2 Use the ``--population`` to get the spectrum for whole population, rather than each individual .. code:: console $ mutyper variants $fasta $vcf | mutyper spectra - --population | head | cut -f-5 .. parsed-literal:: AAA>AGA AAT>AGT ACA>AAA ACA>AGA ACA>ATA 2 1 1 2 3 ``ksfs`` subcommand ~~~~~~~~~~~~~~~~~~~ Usage: .. code:: console $ mutyper ksfs -h .. parsed-literal:: usage: mutyper ksfs [-h] [--verbose] vcf compute sample frequency spectrum for each mutation type from a VCF/BCF file with mutation_type data (i.e. output from variants subcommand ) and stream to stdout positional arguments: vcf VCF/BCF file, usually for a single chromosome ("-" for stdin) options: -h, --help show this help message and exit --verbose increase logging verbosity Similar to the previous subcommand, but now we are generating a sample frequency spectrum (SFS) for each mutation type. .. code:: console $ mutyper variants $fasta $vcf | mutyper ksfs - | head | cut -f-5 .. parsed-literal:: sample_frequency AAA>AGA AAT>AGT ACA>AAA ACA>AGA 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 5 0 0 0 0 6 0 0 0 0 7 0 0 0 0 8 0 0 0 0 9 0 0 0 0