SeekDeep Variant Calling

Variant Calling

Currently this program is available in the developmental version of SeekDeep (git checkout develop when installing) but will be included in future releases.

Variant calling and protein variant calling can be done during the processCluster step and can be done with input a results file with the minimum of 4 or 5 columns containing the following data:

sample name
target name
target population unique identifier
read count

An additional column can be supplied that has the sequence for the target or a directory with sequences for each target can be supplied. The directory should be set up so that each target from the target column has a file named [TARGET].fasta

And a genome and gff file will be needed which will be used to call the variants and the gff annotation will be used to translate and annotate the protein variants as well.

`SeekDeep variantCallOnSeqAndProtein`

The program within SeekDeep is called variantCallOnSeqAndProtein. An example command is below. The data for the example below can be downloaded using the buttons below. This program requires that bowtie2 (to align to genome) and samtools (for sorting bams) be installed. If a microhaplotype doesn’t map it’s excluded from the analysis and if a haplotype’s translation ends up being managled (either from read shift mutation or bad haplotype call, it will be excluded from the protein variant calling, these unmappable and untranslatable will be marked out). The translations are done by mapping to the reference and trim off the introns from the alignment, because of this if haplotypes don’t map well or if there is variation of intron haplotypes might not be able to be translated in this manner.

allSelectedClustersInfo_drugAndTopDiv.tsv.gz

meta.tsv

And the genome and gff used for this commands can be downloaded using the following

Code

wget http://seekdeep.brown.edu/data/plasmodiumData/pf.tar.gz 
tar -zxvf pf.tar.gz

Or downloaded from git hub using git-lfs

Code

git clone https://github.com/nickjhathaway/pf_genomes.git
cd pf_genomes
./bin/unzip_data.sh

drug_resistant_aaPositions.tsv

Code

pf_drug_resistant_aaPositions = readr::read_tsv("drug_resistant_aaPositions.tsv")
create_dt(pf_drug_resistant_aaPositions)

Code

SeekDeep variantCallOnSeqAndProtein --resultsFnp allSelectedClustersInfo_drugAndTopDiv.tsv.gz --genomeFnp pf/genomes/Pf3D7.fasta --gffFnp pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/drug_resistant_aaPositions.tsv --dout drugAndTopDiv_varCalls --overWriteDir  --metaFnp meta.tsv --numThreads 14 --getPairwiseComps

Please see SeekDeep variantCallOnSeqAndProtein --help to see all flags, will explain several of them here

Flags

Required

--resultsFnp - A table with at least 4 columns that supply: sample, target, popUIDForTarget, readCount, and also can have a column for sequence
- Flags for setting which columns apply which of the data input above, the defaults are expected columns from a SeekDeep output
  - --sampleColName - Column Name; default=s_Sample
  - --targetNameColName - target Name Column Name, the column name in the table which indicates the different targets;default=p_name
  - --popHapIdColName - a population level naming across samples for haplotypes for this target;default=h_popUID
  - --popHapSeqColName - population Haplotype Sequence Column Name, the seq to call variants on;default=h_Consensus
  - --withinSampleReadCntColName - within Sample Read Cnt Col Column Name;default=c_ReadCnt
--genome - the path to a genome in FASTA format
--gff - the path to a gff file that annotates the genome fasta file

Optional

--knownAminoAcidChangesFnp - a tab delimited (tsv) file with know amino acid changes, should have at least 2 columns, 1 named either ID or transcriptID and the other named AAPosition (case insenitivte), additional columns can be present and optionally added into outpus as additional meta
--metaFnp - a file that supplies meta data, should have at least two columns, a column with the name sample, the names of which match the samples in the --resultsFnp input file, this meta will be added to majority of outputs so it variants can be linked to output
--getPairwiseComps - compare microhaplotypes between each other
--numThreads - number of threads to add
--dout - the name of the output directory
--overWriteDir - over write the ouput directory if it already exists
--aaSummaryRefInfoNextToAlt - by default, one of the summary tables that covers detected amino acid changes from references can report the number of alternative and reference either stacked (default, e.g. numbers for both alternative and reference are in one column, AA_cnt, AA_freq) or next to each one (e.g. reference_AA_cnt, reference_AA_freq, alternate_AA_cnt, alternate_AA_freq), which is turned on by this flag

Output

The full output can be downloaded here:

drugAndTopDiv_varCalls.tar.gz

There will be a directory created for each target in the input and this will have a breakdown about the comparison of sequences within the target, diversity, and the variants. There will also be a directory called reports that combines the data from all targets.

An example of the layout of the results of a target analysis

Code

tree -n drugAndTopDiv_varCalls/Pf07-0403507-0403717

drugAndTopDiv_varCalls/Pf07-0403507-0403717
├── inputSeqs.fasta.gz
└── variantCalling
    ├── divMeasures.tab.txt
    ├── summaryTable.tab.txt.gz
    ├── timeLog.txt
    ├── uniqueSeqs.fasta.gz
    ├── uniqueSeqs_labIsolateNames.tab.txt.gz
    ├── uniqueSeqs_meta.tab.txt.gz
    ├── uniqueSeqs_names.tab.txt.gz
    └── variantCalls
        ├── PF3D7_0709000.1-protein.vcf.gz
        ├── PF3D7_0709000.1-protein_aminoAcidKnownMutations.tab.txt.gz
        ├── PF3D7_0709000.1-protein_aminoAcidVariable.tab.txt.gz
        ├── PF3D7_0709000.1-protein_aminoAcidsAll.tab.txt.gz
        ├── PF3D7_0709000.1-protein_variableRegion.bed
        ├── PF3D7_0709000.1-translatedSeqsAATyped.tab.txt.gz
        ├── PF3D7_0709000.1-uniqueTranslatedSeqs.fasta.gz
        ├── PF3D7_0709000.1-uniqueTranslatedSeqs_labIsolateNames.tab.txt.gz
        ├── PF3D7_0709000.1-uniqueTranslatedSeqs_meta.tab.txt.gz
        ├── PF3D7_0709000.1-uniqueTranslatedSeqs_names.tab.txt.gz
        ├── Pf3D7_07_v3-SNPs.tab.txt.gz
        ├── Pf3D7_07_v3-allBases.tab.txt.gz
        ├── Pf3D7_07_v3-chromosome_variableRegion.bed
        ├── Pf3D7_07_v3-complex-genomic.vcf.gz
        ├── Pf3D7_07_v3-genomic.vcf.gz
        ├── Pf3D7_07_v3-knownAA_SNPs.tab.txt.gz
        ├── geneInfos
        │   ├── gene_PF3D7_0709000.1.bed
        │   ├── gene_PF3D7_0709000.1_basePositions.tab.txt
        │   ├── gene_PF3D7_0709000.1_cDNA.fasta
        │   ├── gene_PF3D7_0709000.1_exonIntronPositions.bed
        │   ├── gene_PF3D7_0709000.1_exonIntronPositions.tab.txt
        │   ├── gene_PF3D7_0709000.1_gDNA.fasta
        │   └── gene_PF3D7_0709000.1_protein.fasta
        ├── inputSeqs.bed
        ├── seqSNPTyped.tab.txt.gz
        ├── seqsAATyped.tab.txt.gz
        ├── seqsTranslationFiltered.txt
        ├── seqsUnableToBeMapped.txt
        ├── translatedDivMeasures.tab.txt
        ├── translatedInput.bed
        ├── translatedInput.fasta.gz
        ├── variantsPerSeqAln.tab.txt.gz
        └── variantsPerTranslatedSeq.tab.txt.gz

4 directories, 41 files

The reports directory

Code

tree -n drugAndTopDiv_varCalls/reports

drugAndTopDiv_varCalls/reports
├── info
│   ├── genomicLocPerTargetsCounts.tab.txt.gz
│   ├── genomicLocsForKnownAAChanges.bed
│   ├── knownLocCoverage.tsv
│   ├── meta.tsv
│   ├── proteinLocPerTargetsCounts.tab.txt.gz
│   ├── targetsIntersectingWithGenesInfo.tsv
│   ├── targetsIntersectingWithLocsForKnownAAChanges.bed
│   └── transcriptLocsForKnownAAChanges.bed
├── log.txt
├── popGenetics
│   ├── allDiversityMeasures.tsv.gz
│   └── allTranslatedDivMeasures.tsv.gz
├── summary
│   ├── AAChangesInfo.tsv.gz
│   ├── allSummaryTables.tab.txt.gz
│   ├── all_uniqueSeqs.fasta.gz
│   ├── all_uniqueSeqs_labIsolateNames.tab.txt.gz
│   ├── hapCountsPerSamplePerTarget.tsv.gz
│   ├── readCountsPerSamplePerTarget.tsv.gz
│   └── unmappedUntranslatableHapCounts.tsv.gz
└── vcfs
    ├── allComplexGenomicVariantCalls.vcf.gz
    ├── allGenomicVariantCalls.vcf.gz
    ├── allProteinVariantCalls.vcf.gz
    ├── knownAAChangesGenomicVariantCalls.vcf.gz
    └── knownAAChangesProteinVariantCalls.vcf.gz

5 directories, 23 files

VCF outputs:

Of note, all GT (genotype) fields in the vcfs will be forced to a ploidy (controlled by --ploidy which defaults to 2)

vcfs/allComplexGenomicVariantCalls.vcf.gz - All genomic variants called, will make complex variant calls (which combines calls of close by SNPs controlled by --complexVariantWithinDist default = 12)
vcfs/allGenomicVariantCalls.vcf.gz - All genomic variants called, if variation is covered by multiple targets, cars are collapsed (which can be controlled by flags --combineOverlappingCallsAcrossTargets, by default the target with more read coverage for a sample will be used for variant, this flag just sums across all covering targets, by summing the read counts will interference with any copy number variation coverage)
vcfs/allProteinVariantCalls.vcf.gz - All protein variants called, chromosomes are protein transcripts
vcfs/knownAAChangesGenomicVariantCalls.vcf.gz - If positions supplied, will subset to just those positions of the genomic locations that cover those
vcfs/knownAAChangesProteinVariantCalls.vcf.gz - If positions supplied, will subset to just those positions

Locations info:

info/transcriptLocsForKnownAAChanges.bed - The transcripts bed file of known mutations if supplied, will also contain meta on what genomic locations this covers
info/genomicLocsForKnownAAChanges.bed - The genomic locations that overlap the known mutations if supplied
info/targetsIntersectingWithGenesInfo.tsv - A list of the genes intersected with what targets (to see what specific amino acids are covered go to drugAndTopDiv_varCalls/[TARGET]/variantCalling/variantCalls/translatedInput.bed)
info/knownLocCoverage.tsv - if known AA changes are supplied, this will reports which of your targets cover each known location

summary:

summary/AAChangesInfo.tsv.gz - A summary of per amino acid change detected in the data, if known amino acid locations are given with known alts, will force calls for these changes even if not observed (so 0’s will be placed)
summary/allSummaryTables.tab.txt.gz - Summary of translated sequences and the variants detected/known for each sequence see
summary/hapCountsPerSamplePerTarget.tsv.gz - A target by sample table of the number of haplotypes per target
summary/readCountsPerSamplePerTarget.tsv.gz - A target by sample table of the read depth per target
summary/unmmappedUntranslatableHapCounts.tsv.gz - A count of the number of haplotypes per target that couldn’t be mapped and of ones that could be mapped but not translated

Column names explantation

AAChangesInfo.tsv.gz

Many of the columns below were created to reflect previous programs naming schemes and are subject to change as feedback about this table is taken into account.

Gene_ID - The gene identifier within the gff provided
Gene_Transcript_ID - the mRNA transcript ID with in the gff provided
Gene - common gene name, ff it exists (e.g. AMA1, CSP, DHPS, CRT)
Mutation_Name - A name given to the position, will be [Gene]-[Ref-tricode-amino-acid]-[Pos]-(Alt-tricode-amino-acid)
ExonicFunc - The type of type observed e.g. missense_variant
AA_Change - Similar to Mutation_Name but without the Gene name
Targeted - Whether or not this change is in the table provided of known changes
CoveredBy - What haplotypes targets cover this position
sample - the sample for the which the count data in the columns below belongs to
reference_AA_pos - The reference AA position (1-based, to reflect what is commonly reported)
reference_AA - What the reference AA is
AA - the amino acid for the sample for this reference_AA_pos
is_reference - Whether or not this AA is the reference or not
AA_cnt - the within sample read count for this AA
AA_freq - the within sample read frequency [0-1] for this AA
AA_withinSampleCoverage - The total reads covering this amino acid position (the sum of all reads for all AA detected for this reference_AA_pos for this sample)
AlleleCount - The population level count of number of haplotypes/alleles with this Gene_Transcript_ID-reference_AA_pos-AA
AlleleFrequency - The population level frequency [0-1] of haplotypes/alleles with this Gene_Transcript_ID-reference_AA_pos-AA
SampleCount - The population level count samples with this Gene_Transcript_ID-reference_AA_pos-AA
SamplePrevalence - The population level frequency [0-1] of samples with this Gene_Transcript_ID-reference_AA_pos-AA

allSummaryTables.tab.txt.gz

In addition to the columns below, any meta data supplied with --metaFnp will also be included in this below table. The table will have a line for each time a sequence supplied so several fields will be repeated for the unqiue sequences

CollapsedName - A name given to the collapsed unique sequences from the input
seq - The sequence for this haplotype
originalIdentifier - the original identifier in the input table for this sequence
readCount - The read count for the sample for this line
sample - the sample name in which this haplotype was found
target - the target name for this sequence
chrom - the genomic chromosome of this target
0based_start - 0 based start position of the start of this haplotype in the genome supplied
0based_end - the 0 based end postion (non-inclusive) of this haplotype in the genome supplied
length - the length of the region this haplotype maps to
strand - the strand the haplotype mapped to
transcript - if this haplotype intersects any coding regions, the name of the transcript it overlaps with
translatedSeq - if this haplotype intersects any coding regions, this will be the translated sequence (which is determine by trim to just coding DNA sequence)
transcript_1based_start - the start within the protein if it intersects any coding regions (1-based)
transcript_1based_end - the end within the protein if it intersects any coding regions (1-based, inclusive)
transcript_length - the length of the protein region the haplotype maps to if it maps to a coding sequence
transcript_knownAATyped - if known amino acids are provided, this will have the calls for those positions (format is REFAA-Pos when it matches reference and if it has an alt it will be RefAA-Pos-AltAA)
transcript_fullAATyped - this will have the calls of any variant determined in the population as well as any known supplied (format is REFAA-Pos when it matches reference and if it has an alt it will be RefAA-Pos-AltAA)

Diversity measures:

popGenetics/allDiversityMeasures.tsv.gz - Several diversity measures calculated for each target
popGenetics/allTranslatedDivMeasures.tsv.gz - The same diversity measures as above but for the translated sequences

Column names explantation

id - the name of the target
totalHaplotypes - the total number of targets (this is different than number of samples since samples could have multiple unique haplotypes)
uniqueHaplotypes - the total number of unique microhaplotypes
singlets - the number of microhaplotypes that are found only once
doublets - the number of microhaplotypes that are found at least twice
expShannonEntropy - exp(ShannonEntropy)
ShannonEntropyE - [ShannonEntropyE](https://www.sciencedirect.com/topics/engineering/shannon-entropy#:~:text=The%20Shannon%20entropy%20S%20(%20x,new%20value%20in%20the%20process.)
effectiveNumOfAlleles - effective Num Of Alleles, would have to gather at least this number of haplotypes from the same population to get similar diversity measures as seen here
SimpsonIndex - Simpson Index of diversity
he - expected heterozygosity (the chance that if you drew two microhaplotypes at random from the population that they are different)
ExpP3 - the chance that if you drew three microhaplotypes at random from the population that they are three different microhaplotypes
ExpP4 - the chance that if you drew four microhaplotypes at random from the population that they are four different microhaplotypes
ExpP5 - the chance that if you drew five microhaplotypes at random from the population that they are five different microhaplotypes
lengthPolymorphism - whether there is significant length polymorphism in the microhaplotypes

if --getPairwiseComps flag is used

avgPercentID - the average percent identity between pairwise comparisons between microhaplotypes
avgNumOfDiffs - the average number of differences between pairwise comparisons between microhaplotypes
NucleotideDiversity - NucleotideDiversity
simpleAvalance -
completeAvalance -
nSegratingSites - the number of sites that differ between seqs
TajimaD - the Tajima’s D for these microhaplotypes
TajimaDPVal - the p-value for Tajima’s D calc

Output files tables examples

allDiversityMeasures.tsv.gz

Code

create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/popGenetics/allDiversityMeasures.tsv.gz"))

allTranslatedDivMeasures.tsv.gz

Code

create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/popGenetics/allTranslatedDivMeasures.tsv.gz"))

genomicLocsForKnownAAChanges.bed

Code

create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/info/transcriptLocsForKnownAAChanges.bed"))

transcriptLocsForKnownAAChanges.bed

Code

create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/info/transcriptLocsForKnownAAChanges.bed"))

targetsIntersectingWithGenesInfo.tsv

Code

create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/info/targetsIntersectingWithGenesInfo.tsv"))

knownLocCoverage.tsv

Code

create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/info/knownLocCoverage.tsv"))

unmmappedUntranslatableHapCounts.tsv.gz

Code

create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/summary/unmappedUntranslatableHapCounts.tsv.gz"))

AAChangesInfo.tsv.gz

Code

create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/summary/AAChangesInfo.tsv.gz", n_max = 100))

allSummaryTables.tab.txt.gz

Code

create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/summary/allSummaryTables.tab.txt.gz", n_max = 100))