SeekDeep Variant Calling
Variant Calling
Currently this program is available in the developmental version of SeekDeep (git checkout develop
when installing) but will be included in future releases.
Variant calling and protein variant calling can be done during the processCluster step and can be done with input a results file with the minimum of 4 or 5 columns containing the following data:
- sample name
- target name
- target population unique identifier
- read count
An additional column can be supplied that has the sequence for the target or a directory with sequences for each target can be supplied. The directory should be set up so that each target from the target column has a file named [TARGET].fasta
And a genome and gff file will be needed which will be used to call the variants and the gff annotation will be used to translate and annotate the protein variants as well.
SeekDeep variantCallOnSeqAndProtein
The program within SeekDeep is called variantCallOnSeqAndProtein
. An example command is below. The data for the example below can be downloaded using the buttons below. This program requires that bowtie2
(to align to genome) and samtools
(for sorting bams) be installed. If a microhaplotype doesn’t map it’s excluded from the analysis and if a haplotype’s translation ends up being managled (either from read shift mutation or bad haplotype call, it will be excluded from the protein variant calling, these unmappable and untranslatable will be marked out). The translations are done by mapping to the reference and trim off the introns from the alignment, because of this if haplotypes don’t map well or if there is variation of intron haplotypes might not be able to be translated in this manner.
And the genome and gff used for this commands can be downloaded using the following
Code
wget http://seekdeep.brown.edu/data/plasmodiumData/pf.tar.gz
tar -zxvf pf.tar.gz
Or downloaded from git hub using git-lfs
Code
git clone https://github.com/nickjhathaway/pf_genomes.git
cd pf_genomes
./bin/unzip_data.sh
Code
pf_drug_resistant_aaPositions = readr::read_tsv("drug_resistant_aaPositions.tsv")
create_dt(pf_drug_resistant_aaPositions)
Code
SeekDeep variantCallOnSeqAndProtein --resultsFnp allSelectedClustersInfo_drugAndTopDiv.tsv.gz --genomeFnp pf/genomes/Pf3D7.fasta --gffFnp pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/drug_resistant_aaPositions.tsv --dout drugAndTopDiv_varCalls --overWriteDir --metaFnp meta.tsv --numThreads 14 --getPairwiseComps
Please see SeekDeep variantCallOnSeqAndProtein --help
to see all flags, will explain several of them here
Flags
Required
-
--resultsFnp - A table with at least 4 columns that supply: sample, target, popUIDForTarget, readCount, and also can have a column for sequence
- Flags for setting which columns apply which of the data input above, the defaults are expected columns from a SeekDeep output
- --sampleColName - Column Name; default=s_Sample
- --targetNameColName - target Name Column Name, the column name in the table which indicates the different targets;default=p_name
- --popHapIdColName - a population level naming across samples for haplotypes for this target;default=h_popUID
- --popHapSeqColName - population Haplotype Sequence Column Name, the seq to call variants on;default=h_Consensus
- --withinSampleReadCntColName - within Sample Read Cnt Col Column Name;default=c_ReadCnt
- Flags for setting which columns apply which of the data input above, the defaults are expected columns from a SeekDeep output
-
--genome - the path to a genome in FASTA format
- --gff - the path to a gff file that annotates the genome fasta file
Optional
-
--knownAminoAcidChangesFnp - a tab delimited (tsv) file with know amino acid changes, should have at least 2 columns, 1 named either ID or transcriptID and the other named AAPosition (case insenitivte), additional columns can be present and optionally added into outpus as additional meta
-
--metaFnp - a file that supplies meta data, should have at least two columns, a column with the name sample, the names of which match the samples in the --resultsFnp input file, this meta will be added to majority of outputs so it variants can be linked to output
-
--getPairwiseComps - compare microhaplotypes between each other
-
--numThreads - number of threads to add
-
--dout - the name of the output directory
-
--overWriteDir - over write the ouput directory if it already exists
- --aaSummaryRefInfoNextToAlt - by default, one of the summary tables that covers detected amino acid changes from references can report the number of alternative and reference either stacked (default, e.g. numbers for both alternative and reference are in one column, AA_cnt, AA_freq) or next to each one (e.g. reference_AA_cnt, reference_AA_freq, alternate_AA_cnt, alternate_AA_freq), which is turned on by this flag
Output
The full output can be downloaded here:
There will be a directory created for each target in the input and this will have a breakdown about the comparison of sequences within the target, diversity, and the variants. There will also be a directory called reports
that combines the data from all targets.
An example of the layout of the results of a target analysis
Code
tree -n drugAndTopDiv_varCalls/Pf07-0403507-0403717
drugAndTopDiv_varCalls/Pf07-0403507-0403717
├── inputSeqs.fasta.gz
└── variantCalling
├── divMeasures.tab.txt
├── summaryTable.tab.txt.gz
├── timeLog.txt
├── uniqueSeqs.fasta.gz
├── uniqueSeqs_labIsolateNames.tab.txt.gz
├── uniqueSeqs_meta.tab.txt.gz
├── uniqueSeqs_names.tab.txt.gz
└── variantCalls
├── PF3D7_0709000.1-protein.vcf.gz
├── PF3D7_0709000.1-protein_aminoAcidKnownMutations.tab.txt.gz
├── PF3D7_0709000.1-protein_aminoAcidVariable.tab.txt.gz
├── PF3D7_0709000.1-protein_aminoAcidsAll.tab.txt.gz
├── PF3D7_0709000.1-protein_variableRegion.bed
├── PF3D7_0709000.1-translatedSeqsAATyped.tab.txt.gz
├── PF3D7_0709000.1-uniqueTranslatedSeqs.fasta.gz
├── PF3D7_0709000.1-uniqueTranslatedSeqs_labIsolateNames.tab.txt.gz
├── PF3D7_0709000.1-uniqueTranslatedSeqs_meta.tab.txt.gz
├── PF3D7_0709000.1-uniqueTranslatedSeqs_names.tab.txt.gz
├── Pf3D7_07_v3-SNPs.tab.txt.gz
├── Pf3D7_07_v3-allBases.tab.txt.gz
├── Pf3D7_07_v3-chromosome_variableRegion.bed
├── Pf3D7_07_v3-complex-genomic.vcf.gz
├── Pf3D7_07_v3-genomic.vcf.gz
├── Pf3D7_07_v3-knownAA_SNPs.tab.txt.gz
├── geneInfos
│ ├── gene_PF3D7_0709000.1.bed
│ ├── gene_PF3D7_0709000.1_basePositions.tab.txt
│ ├── gene_PF3D7_0709000.1_cDNA.fasta
│ ├── gene_PF3D7_0709000.1_exonIntronPositions.bed
│ ├── gene_PF3D7_0709000.1_exonIntronPositions.tab.txt
│ ├── gene_PF3D7_0709000.1_gDNA.fasta
│ └── gene_PF3D7_0709000.1_protein.fasta
├── inputSeqs.bed
├── seqSNPTyped.tab.txt.gz
├── seqsAATyped.tab.txt.gz
├── seqsTranslationFiltered.txt
├── seqsUnableToBeMapped.txt
├── translatedDivMeasures.tab.txt
├── translatedInput.bed
├── translatedInput.fasta.gz
├── variantsPerSeqAln.tab.txt.gz
└── variantsPerTranslatedSeq.tab.txt.gz
4 directories, 41 files
The reports directory
Code
tree -n drugAndTopDiv_varCalls/reports
drugAndTopDiv_varCalls/reports
├── info
│ ├── genomicLocPerTargetsCounts.tab.txt.gz
│ ├── genomicLocsForKnownAAChanges.bed
│ ├── knownLocCoverage.tsv
│ ├── meta.tsv
│ ├── proteinLocPerTargetsCounts.tab.txt.gz
│ ├── targetsIntersectingWithGenesInfo.tsv
│ ├── targetsIntersectingWithLocsForKnownAAChanges.bed
│ └── transcriptLocsForKnownAAChanges.bed
├── log.txt
├── popGenetics
│ ├── allDiversityMeasures.tsv.gz
│ └── allTranslatedDivMeasures.tsv.gz
├── summary
│ ├── AAChangesInfo.tsv.gz
│ ├── allSummaryTables.tab.txt.gz
│ ├── all_uniqueSeqs.fasta.gz
│ ├── all_uniqueSeqs_labIsolateNames.tab.txt.gz
│ ├── hapCountsPerSamplePerTarget.tsv.gz
│ ├── readCountsPerSamplePerTarget.tsv.gz
│ └── unmappedUntranslatableHapCounts.tsv.gz
└── vcfs
├── allComplexGenomicVariantCalls.vcf.gz
├── allGenomicVariantCalls.vcf.gz
├── allProteinVariantCalls.vcf.gz
├── knownAAChangesGenomicVariantCalls.vcf.gz
└── knownAAChangesProteinVariantCalls.vcf.gz
5 directories, 23 files
VCF outputs:
Of note, all GT (genotype) fields in the vcfs will be forced to a ploidy (controlled by --ploidy
which defaults to 2)
-
vcfs/allComplexGenomicVariantCalls.vcf.gz - All genomic variants called, will make complex variant calls (which combines calls of close by SNPs controlled by
--complexVariantWithinDist
default = 12) -
vcfs/allGenomicVariantCalls.vcf.gz - All genomic variants called, if variation is covered by multiple targets, cars are collapsed (which can be controlled by flags
--combineOverlappingCallsAcrossTargets
, by default the target with more read coverage for a sample will be used for variant, this flag just sums across all covering targets, by summing the read counts will interference with any copy number variation coverage) -
vcfs/allProteinVariantCalls.vcf.gz - All protein variants called, chromosomes are protein transcripts
- vcfs/knownAAChangesGenomicVariantCalls.vcf.gz - If positions supplied, will subset to just those positions of the genomic locations that cover those
- vcfs/knownAAChangesProteinVariantCalls.vcf.gz - If positions supplied, will subset to just those positions
Locations info:
- info/transcriptLocsForKnownAAChanges.bed - The transcripts bed file of known mutations if supplied, will also contain meta on what genomic locations this covers
- info/genomicLocsForKnownAAChanges.bed - The genomic locations that overlap the known mutations if supplied
- info/targetsIntersectingWithGenesInfo.tsv - A list of the genes intersected with what targets (to see what specific amino acids are covered go to drugAndTopDiv_varCalls/[TARGET]/variantCalling/variantCalls/translatedInput.bed)
- info/knownLocCoverage.tsv - if known AA changes are supplied, this will reports which of your targets cover each known location
summary:
- summary/AAChangesInfo.tsv.gz - A summary of per amino acid change detected in the data, if known amino acid locations are given with known alts, will force calls for these changes even if not observed (so 0’s will be placed)
-
summary/allSummaryTables.tab.txt.gz - Summary of translated sequences and the variants detected/known for each sequence see
-
summary/hapCountsPerSamplePerTarget.tsv.gz - A target by sample table of the number of haplotypes per target
-
summary/readCountsPerSamplePerTarget.tsv.gz - A target by sample table of the read depth per target
- summary/unmmappedUntranslatableHapCounts.tsv.gz - A count of the number of haplotypes per target that couldn’t be mapped and of ones that could be mapped but not translated
Column names explantation
AAChangesInfo.tsv.gz
Many of the columns below were created to reflect previous programs naming schemes and are subject to change as feedback about this table is taken into account.
-
Gene_ID - The gene identifier within the gff provided
-
Gene_Transcript_ID - the mRNA transcript ID with in the gff provided
-
Gene - common gene name, ff it exists (e.g. AMA1, CSP, DHPS, CRT)
-
Mutation_Name - A name given to the position, will be [Gene]-[Ref-tricode-amino-acid]-[Pos]-(Alt-tricode-amino-acid)
-
ExonicFunc - The type of type observed e.g. missense_variant
- AA_Change - Similar to Mutation_Name but without the Gene name
-
Targeted - Whether or not this change is in the table provided of known changes
-
CoveredBy - What haplotypes targets cover this position
-
sample - the sample for the which the count data in the columns below belongs to
-
reference_AA_pos - The reference AA position (1-based, to reflect what is commonly reported)
-
reference_AA - What the reference AA is
-
AA - the amino acid for the sample for this reference_AA_pos
-
is_reference - Whether or not this AA is the reference or not
-
AA_cnt - the within sample read count for this AA
-
AA_freq - the within sample read frequency [0-1] for this AA
-
AA_withinSampleCoverage - The total reads covering this amino acid position (the sum of all reads for all AA detected for this reference_AA_pos for this sample)
- AlleleCount - The population level count of number of haplotypes/alleles with this Gene_Transcript_ID-reference_AA_pos-AA
- AlleleFrequency - The population level frequency [0-1] of haplotypes/alleles with this Gene_Transcript_ID-reference_AA_pos-AA
- SampleCount - The population level count samples with this Gene_Transcript_ID-reference_AA_pos-AA
- SamplePrevalence - The population level frequency [0-1] of samples with this Gene_Transcript_ID-reference_AA_pos-AA
allSummaryTables.tab.txt.gz
In addition to the columns below, any meta data supplied with --metaFnp
will also be included in this below table. The table will have a line for each time a sequence supplied so several fields will be repeated for the unqiue sequences
-
CollapsedName - A name given to the collapsed unique sequences from the input
- seq - The sequence for this haplotype
-
originalIdentifier - the original identifier in the input table for this sequence
-
readCount - The read count for the sample for this line
-
sample - the sample name in which this haplotype was found
-
target - the target name for this sequence
-
chrom - the genomic chromosome of this target
-
0based_start - 0 based start position of the start of this haplotype in the genome supplied
-
0based_end - the 0 based end postion (non-inclusive) of this haplotype in the genome supplied
- length - the length of the region this haplotype maps to
-
strand - the strand the haplotype mapped to
-
transcript - if this haplotype intersects any coding regions, the name of the transcript it overlaps with
-
translatedSeq - if this haplotype intersects any coding regions, this will be the translated sequence (which is determine by trim to just coding DNA sequence)
- transcript_1based_start - the start within the protein if it intersects any coding regions (1-based)
- transcript_1based_end - the end within the protein if it intersects any coding regions (1-based, inclusive)
-
transcript_length - the length of the protein region the haplotype maps to if it maps to a coding sequence
-
transcript_knownAATyped - if known amino acids are provided, this will have the calls for those positions (format is REFAA-Pos when it matches reference and if it has an alt it will be RefAA-Pos-AltAA)
- transcript_fullAATyped - this will have the calls of any variant determined in the population as well as any known supplied (format is REFAA-Pos when it matches reference and if it has an alt it will be RefAA-Pos-AltAA)
Diversity measures:
-
popGenetics/allDiversityMeasures.tsv.gz - Several diversity measures calculated for each target
- popGenetics/allTranslatedDivMeasures.tsv.gz - The same diversity measures as above but for the translated sequences
Column names explantation
-
id - the name of the target
-
totalHaplotypes - the total number of targets (this is different than number of samples since samples could have multiple unique haplotypes)
-
uniqueHaplotypes - the total number of unique microhaplotypes
-
singlets - the number of microhaplotypes that are found only once
-
doublets - the number of microhaplotypes that are found at least twice
- expShannonEntropy - exp(ShannonEntropy)
-
ShannonEntropyE - [ShannonEntropyE](https://www.sciencedirect.com/topics/engineering/shannon-entropy#:~:text=The%20Shannon%20entropy%20S%20(%20x,new%20value%20in%20the%20process.)
-
effectiveNumOfAlleles - effective Num Of Alleles, would have to gather at least this number of haplotypes from the same population to get similar diversity measures as seen here
-
SimpsonIndex - Simpson Index of diversity
- he - expected heterozygosity (the chance that if you drew two microhaplotypes at random from the population that they are different)
-
ExpP3 - the chance that if you drew three microhaplotypes at random from the population that they are three different microhaplotypes
-
ExpP4 - the chance that if you drew four microhaplotypes at random from the population that they are four different microhaplotypes
-
ExpP5 - the chance that if you drew five microhaplotypes at random from the population that they are five different microhaplotypes
- lengthPolymorphism - whether there is significant length polymorphism in the microhaplotypes
if --getPairwiseComps flag is used
-
avgPercentID - the average percent identity between pairwise comparisons between microhaplotypes
-
avgNumOfDiffs - the average number of differences between pairwise comparisons between microhaplotypes
- NucleotideDiversity - NucleotideDiversity
-
simpleAvalance -
-
completeAvalance -
-
nSegratingSites - the number of sites that differ between seqs
-
TajimaD - the Tajima’s D for these microhaplotypes
- TajimaDPVal - the p-value for Tajima’s D calc
Output files tables examples
allDiversityMeasures.tsv.gz
Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/popGenetics/allDiversityMeasures.tsv.gz"))
allTranslatedDivMeasures.tsv.gz
Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/popGenetics/allTranslatedDivMeasures.tsv.gz"))
genomicLocsForKnownAAChanges.bed
Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/info/transcriptLocsForKnownAAChanges.bed"))
transcriptLocsForKnownAAChanges.bed
Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/info/transcriptLocsForKnownAAChanges.bed"))
targetsIntersectingWithGenesInfo.tsv
Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/info/targetsIntersectingWithGenesInfo.tsv"))
knownLocCoverage.tsv
Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/info/knownLocCoverage.tsv"))
unmmappedUntranslatableHapCounts.tsv.gz
Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/summary/unmappedUntranslatableHapCounts.tsv.gz"))
AAChangesInfo.tsv.gz
Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/summary/AAChangesInfo.tsv.gz", n_max = 100))
allSummaryTables.tab.txt.gz
Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/summary/allSummaryTables.tab.txt.gz", n_max = 100))