SeekDeep
  • Home
  • Installing
    • Mac OS
    • Ubuntu
    • Windows
    • Vagrant/virtual image (any system)
  • Code
    • Github
  • Usages

    • Pipeline
    • extractor/extractorPairedEnd
    • makeSampleDirectories
    • qluster
    • processClusters
    • popClusteringViewer

    • Pipeline Wrapper
    • setupTarAmpAnalysis

    • Utilities
    • genTargetInfoFromGenomes
    • SeekDeep control mixture benchmarking
    • SeekDeep Variant Calling
  • Tutorials

    • Single End
    • Ion Torrent with MIDs

    • Illumina Paired End
    • Paired End No MIDs/Barcodes
    • Paired End With MIDs/Barcodes
  • Misc Info
    • Illumina Paired End Info
  • References
    • versions
    • References
  1. SeekDeep Variant Calling

Contents

  • Variant Calling
  • SeekDeep variantCallOnSeqAndProtein
    • Flags
      • Required
      • Optional
    • Output
      • VCF outputs:
      • Locations info:
      • summary:
        • Column names explantation
          • AAChangesInfo.tsv.gz
          • allSummaryTables.tab.txt.gz
      • Diversity measures:
        • Column names explantation
      • Output files tables examples
        • allDiversityMeasures.tsv.gz
        • allTranslatedDivMeasures.tsv.gz
        • genomicLocsForKnownAAChanges.bed
        • transcriptLocsForKnownAAChanges.bed
        • targetsIntersectingWithGenesInfo.tsv
        • knownLocCoverage.tsv
        • unmmappedUntranslatableHapCounts.tsv.gz
        • AAChangesInfo.tsv.gz
        • allSummaryTables.tab.txt.gz

SeekDeep Variant Calling

  • Show All Code
  • Hide All Code

  • View Source

Variant Calling

Currently this program is available in the developmental version of SeekDeep (git checkout develop when installing) but will be included in future releases.

Variant calling and protein variant calling can be done during the processCluster step and can be done with input a results file with the minimum of 4 or 5 columns containing the following data:

  • sample name
  • target name
  • target population unique identifier
  • read count

An additional column can be supplied that has the sequence for the target or a directory with sequences for each target can be supplied. The directory should be set up so that each target from the target column has a file named [TARGET].fasta

And a genome and gff file will be needed which will be used to call the variants and the gff annotation will be used to translate and annotate the protein variants as well.

SeekDeep variantCallOnSeqAndProtein

The program within SeekDeep is called variantCallOnSeqAndProtein. An example command is below. The data for the example below can be downloaded using the buttons below. This program requires that bowtie2 (to align to genome) and samtools (for sorting bams) be installed. If a microhaplotype doesn’t map it’s excluded from the analysis and if a haplotype’s translation ends up being managled (either from read shift mutation or bad haplotype call, it will be excluded from the protein variant calling, these unmappable and untranslatable will be marked out). The translations are done by mapping to the reference and trim off the introns from the alignment, because of this if haplotypes don’t map well or if there is variation of intron haplotypes might not be able to be translated in this manner.

allSelectedClustersInfo_drugAndTopDiv.tsv.gz

meta.tsv

And the genome and gff used for this commands can be downloaded using the following

Code
wget http://seekdeep.brown.edu/data/plasmodiumData/pf.tar.gz 
tar -zxvf pf.tar.gz

Or downloaded from git hub using git-lfs

Code
git clone https://github.com/nickjhathaway/pf_genomes.git
cd pf_genomes
./bin/unzip_data.sh

drug_resistant_aaPositions.tsv

Code
pf_drug_resistant_aaPositions = readr::read_tsv("drug_resistant_aaPositions.tsv")
create_dt(pf_drug_resistant_aaPositions)
Code
SeekDeep variantCallOnSeqAndProtein --resultsFnp allSelectedClustersInfo_drugAndTopDiv.tsv.gz --genomeFnp pf/genomes/Pf3D7.fasta --gffFnp pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/drug_resistant_aaPositions.tsv --dout drugAndTopDiv_varCalls --overWriteDir  --metaFnp meta.tsv --numThreads 14 --getPairwiseComps

Please see SeekDeep variantCallOnSeqAndProtein --help to see all flags, will explain several of them here

Flags

Required

  • --resultsFnp - A table with at least 4 columns that supply: sample, target, popUIDForTarget, readCount, and also can have a column for sequence
    • Flags for setting which columns apply which of the data input above, the defaults are expected columns from a SeekDeep output
      • --sampleColName - Column Name; default=s_Sample
      • --targetNameColName - target Name Column Name, the column name in the table which indicates the different targets;default=p_name
      • --popHapIdColName - a population level naming across samples for haplotypes for this target;default=h_popUID
      • --popHapSeqColName - population Haplotype Sequence Column Name, the seq to call variants on;default=h_Consensus
      • --withinSampleReadCntColName - within Sample Read Cnt Col Column Name;default=c_ReadCnt
  • --genome - the path to a genome in FASTA format
  • --gff - the path to a gff file that annotates the genome fasta file

Optional

  • --knownAminoAcidChangesFnp - a tab delimited (tsv) file with know amino acid changes, should have at least 2 columns, 1 named either ID or transcriptID and the other named AAPosition (case insenitivte), additional columns can be present and optionally added into outpus as additional meta
  • --metaFnp - a file that supplies meta data, should have at least two columns, a column with the name sample, the names of which match the samples in the --resultsFnp input file, this meta will be added to majority of outputs so it variants can be linked to output
  • --getPairwiseComps - compare microhaplotypes between each other
  • --numThreads - number of threads to add
  • --dout - the name of the output directory
  • --overWriteDir - over write the ouput directory if it already exists
  • --aaSummaryRefInfoNextToAlt - by default, one of the summary tables that covers detected amino acid changes from references can report the number of alternative and reference either stacked (default, e.g. numbers for both alternative and reference are in one column, AA_cnt, AA_freq) or next to each one (e.g. reference_AA_cnt, reference_AA_freq, alternate_AA_cnt, alternate_AA_freq), which is turned on by this flag

Output

The full output can be downloaded here:

drugAndTopDiv_varCalls.tar.gz

There will be a directory created for each target in the input and this will have a breakdown about the comparison of sequences within the target, diversity, and the variants. There will also be a directory called reports that combines the data from all targets.

An example of the layout of the results of a target analysis

Code
tree -n drugAndTopDiv_varCalls/Pf07-0403507-0403717
drugAndTopDiv_varCalls/Pf07-0403507-0403717
├── inputSeqs.fasta.gz
└── variantCalling
    ├── divMeasures.tab.txt
    ├── summaryTable.tab.txt.gz
    ├── timeLog.txt
    ├── uniqueSeqs.fasta.gz
    ├── uniqueSeqs_labIsolateNames.tab.txt.gz
    ├── uniqueSeqs_meta.tab.txt.gz
    ├── uniqueSeqs_names.tab.txt.gz
    └── variantCalls
        ├── PF3D7_0709000.1-protein.vcf.gz
        ├── PF3D7_0709000.1-protein_aminoAcidKnownMutations.tab.txt.gz
        ├── PF3D7_0709000.1-protein_aminoAcidVariable.tab.txt.gz
        ├── PF3D7_0709000.1-protein_aminoAcidsAll.tab.txt.gz
        ├── PF3D7_0709000.1-protein_variableRegion.bed
        ├── PF3D7_0709000.1-translatedSeqsAATyped.tab.txt.gz
        ├── PF3D7_0709000.1-uniqueTranslatedSeqs.fasta.gz
        ├── PF3D7_0709000.1-uniqueTranslatedSeqs_labIsolateNames.tab.txt.gz
        ├── PF3D7_0709000.1-uniqueTranslatedSeqs_meta.tab.txt.gz
        ├── PF3D7_0709000.1-uniqueTranslatedSeqs_names.tab.txt.gz
        ├── Pf3D7_07_v3-SNPs.tab.txt.gz
        ├── Pf3D7_07_v3-allBases.tab.txt.gz
        ├── Pf3D7_07_v3-chromosome_variableRegion.bed
        ├── Pf3D7_07_v3-complex-genomic.vcf.gz
        ├── Pf3D7_07_v3-genomic.vcf.gz
        ├── Pf3D7_07_v3-knownAA_SNPs.tab.txt.gz
        ├── geneInfos
        │   ├── gene_PF3D7_0709000.1.bed
        │   ├── gene_PF3D7_0709000.1_basePositions.tab.txt
        │   ├── gene_PF3D7_0709000.1_cDNA.fasta
        │   ├── gene_PF3D7_0709000.1_exonIntronPositions.bed
        │   ├── gene_PF3D7_0709000.1_exonIntronPositions.tab.txt
        │   ├── gene_PF3D7_0709000.1_gDNA.fasta
        │   └── gene_PF3D7_0709000.1_protein.fasta
        ├── inputSeqs.bed
        ├── seqSNPTyped.tab.txt.gz
        ├── seqsAATyped.tab.txt.gz
        ├── seqsTranslationFiltered.txt
        ├── seqsUnableToBeMapped.txt
        ├── translatedDivMeasures.tab.txt
        ├── translatedInput.bed
        ├── translatedInput.fasta.gz
        ├── variantsPerSeqAln.tab.txt.gz
        └── variantsPerTranslatedSeq.tab.txt.gz

4 directories, 41 files

The reports directory

Code
tree -n drugAndTopDiv_varCalls/reports
drugAndTopDiv_varCalls/reports
├── info
│   ├── genomicLocPerTargetsCounts.tab.txt.gz
│   ├── genomicLocsForKnownAAChanges.bed
│   ├── knownLocCoverage.tsv
│   ├── meta.tsv
│   ├── proteinLocPerTargetsCounts.tab.txt.gz
│   ├── targetsIntersectingWithGenesInfo.tsv
│   ├── targetsIntersectingWithLocsForKnownAAChanges.bed
│   └── transcriptLocsForKnownAAChanges.bed
├── log.txt
├── popGenetics
│   ├── allDiversityMeasures.tsv.gz
│   └── allTranslatedDivMeasures.tsv.gz
├── summary
│   ├── AAChangesInfo.tsv.gz
│   ├── allSummaryTables.tab.txt.gz
│   ├── all_uniqueSeqs.fasta.gz
│   ├── all_uniqueSeqs_labIsolateNames.tab.txt.gz
│   ├── hapCountsPerSamplePerTarget.tsv.gz
│   ├── readCountsPerSamplePerTarget.tsv.gz
│   └── unmappedUntranslatableHapCounts.tsv.gz
└── vcfs
    ├── allComplexGenomicVariantCalls.vcf.gz
    ├── allGenomicVariantCalls.vcf.gz
    ├── allProteinVariantCalls.vcf.gz
    ├── knownAAChangesGenomicVariantCalls.vcf.gz
    └── knownAAChangesProteinVariantCalls.vcf.gz

5 directories, 23 files

VCF outputs:

Of note, all GT (genotype) fields in the vcfs will be forced to a ploidy (controlled by --ploidy which defaults to 2)

  • vcfs/allComplexGenomicVariantCalls.vcf.gz - All genomic variants called, will make complex variant calls (which combines calls of close by SNPs controlled by --complexVariantWithinDist default = 12)
  • vcfs/allGenomicVariantCalls.vcf.gz - All genomic variants called, if variation is covered by multiple targets, cars are collapsed (which can be controlled by flags --combineOverlappingCallsAcrossTargets, by default the target with more read coverage for a sample will be used for variant, this flag just sums across all covering targets, by summing the read counts will interference with any copy number variation coverage)
  • vcfs/allProteinVariantCalls.vcf.gz - All protein variants called, chromosomes are protein transcripts
  • vcfs/knownAAChangesGenomicVariantCalls.vcf.gz - If positions supplied, will subset to just those positions of the genomic locations that cover those
  • vcfs/knownAAChangesProteinVariantCalls.vcf.gz - If positions supplied, will subset to just those positions

Locations info:

  • info/transcriptLocsForKnownAAChanges.bed - The transcripts bed file of known mutations if supplied, will also contain meta on what genomic locations this covers
  • info/genomicLocsForKnownAAChanges.bed - The genomic locations that overlap the known mutations if supplied
  • info/targetsIntersectingWithGenesInfo.tsv - A list of the genes intersected with what targets (to see what specific amino acids are covered go to drugAndTopDiv_varCalls/[TARGET]/variantCalling/variantCalls/translatedInput.bed)
  • info/knownLocCoverage.tsv - if known AA changes are supplied, this will reports which of your targets cover each known location

summary:

  • summary/AAChangesInfo.tsv.gz - A summary of per amino acid change detected in the data, if known amino acid locations are given with known alts, will force calls for these changes even if not observed (so 0’s will be placed)
  • summary/allSummaryTables.tab.txt.gz - Summary of translated sequences and the variants detected/known for each sequence see
  • summary/hapCountsPerSamplePerTarget.tsv.gz - A target by sample table of the number of haplotypes per target
  • summary/readCountsPerSamplePerTarget.tsv.gz - A target by sample table of the read depth per target
  • summary/unmmappedUntranslatableHapCounts.tsv.gz - A count of the number of haplotypes per target that couldn’t be mapped and of ones that could be mapped but not translated

Column names explantation

AAChangesInfo.tsv.gz

Many of the columns below were created to reflect previous programs naming schemes and are subject to change as feedback about this table is taken into account.

  • Gene_ID - The gene identifier within the gff provided
  • Gene_Transcript_ID - the mRNA transcript ID with in the gff provided
  • Gene - common gene name, ff it exists (e.g. AMA1, CSP, DHPS, CRT)
  • Mutation_Name - A name given to the position, will be [Gene]-[Ref-tricode-amino-acid]-[Pos]-(Alt-tricode-amino-acid)
  • ExonicFunc - The type of type observed e.g. missense_variant
  • AA_Change - Similar to Mutation_Name but without the Gene name
  • Targeted - Whether or not this change is in the table provided of known changes
  • CoveredBy - What haplotypes targets cover this position
  • sample - the sample for the which the count data in the columns below belongs to
  • reference_AA_pos - The reference AA position (1-based, to reflect what is commonly reported)
  • reference_AA - What the reference AA is
  • AA - the amino acid for the sample for this reference_AA_pos
  • is_reference - Whether or not this AA is the reference or not
  • AA_cnt - the within sample read count for this AA
  • AA_freq - the within sample read frequency [0-1] for this AA
  • AA_withinSampleCoverage - The total reads covering this amino acid position (the sum of all reads for all AA detected for this reference_AA_pos for this sample)
  • AlleleCount - The population level count of number of haplotypes/alleles with this Gene_Transcript_ID-reference_AA_pos-AA
  • AlleleFrequency - The population level frequency [0-1] of haplotypes/alleles with this Gene_Transcript_ID-reference_AA_pos-AA
  • SampleCount - The population level count samples with this Gene_Transcript_ID-reference_AA_pos-AA
  • SamplePrevalence - The population level frequency [0-1] of samples with this Gene_Transcript_ID-reference_AA_pos-AA
allSummaryTables.tab.txt.gz

In addition to the columns below, any meta data supplied with --metaFnp will also be included in this below table. The table will have a line for each time a sequence supplied so several fields will be repeated for the unqiue sequences

  • CollapsedName - A name given to the collapsed unique sequences from the input
  • seq - The sequence for this haplotype
  • originalIdentifier - the original identifier in the input table for this sequence
  • readCount - The read count for the sample for this line
  • sample - the sample name in which this haplotype was found
  • target - the target name for this sequence
  • chrom - the genomic chromosome of this target
  • 0based_start - 0 based start position of the start of this haplotype in the genome supplied
  • 0based_end - the 0 based end postion (non-inclusive) of this haplotype in the genome supplied
  • length - the length of the region this haplotype maps to
  • strand - the strand the haplotype mapped to
  • transcript - if this haplotype intersects any coding regions, the name of the transcript it overlaps with
  • translatedSeq - if this haplotype intersects any coding regions, this will be the translated sequence (which is determine by trim to just coding DNA sequence)
  • transcript_1based_start - the start within the protein if it intersects any coding regions (1-based)
  • transcript_1based_end - the end within the protein if it intersects any coding regions (1-based, inclusive)
  • transcript_length - the length of the protein region the haplotype maps to if it maps to a coding sequence
  • transcript_knownAATyped - if known amino acids are provided, this will have the calls for those positions (format is REFAA-Pos when it matches reference and if it has an alt it will be RefAA-Pos-AltAA)
  • transcript_fullAATyped - this will have the calls of any variant determined in the population as well as any known supplied (format is REFAA-Pos when it matches reference and if it has an alt it will be RefAA-Pos-AltAA)

Diversity measures:

  • popGenetics/allDiversityMeasures.tsv.gz - Several diversity measures calculated for each target
  • popGenetics/allTranslatedDivMeasures.tsv.gz - The same diversity measures as above but for the translated sequences

Column names explantation

  • id - the name of the target
  • totalHaplotypes - the total number of targets (this is different than number of samples since samples could have multiple unique haplotypes)
  • uniqueHaplotypes - the total number of unique microhaplotypes
  • singlets - the number of microhaplotypes that are found only once
  • doublets - the number of microhaplotypes that are found at least twice
  • expShannonEntropy - exp(ShannonEntropy)
  • ShannonEntropyE - [ShannonEntropyE](https://www.sciencedirect.com/topics/engineering/shannon-entropy#:~:text=The%20Shannon%20entropy%20S%20(%20x,new%20value%20in%20the%20process.)
  • effectiveNumOfAlleles - effective Num Of Alleles, would have to gather at least this number of haplotypes from the same population to get similar diversity measures as seen here
  • SimpsonIndex - Simpson Index of diversity
  • he - expected heterozygosity (the chance that if you drew two microhaplotypes at random from the population that they are different)
  • ExpP3 - the chance that if you drew three microhaplotypes at random from the population that they are three different microhaplotypes
  • ExpP4 - the chance that if you drew four microhaplotypes at random from the population that they are four different microhaplotypes
  • ExpP5 - the chance that if you drew five microhaplotypes at random from the population that they are five different microhaplotypes
  • lengthPolymorphism - whether there is significant length polymorphism in the microhaplotypes

if --getPairwiseComps flag is used

  • avgPercentID - the average percent identity between pairwise comparisons between microhaplotypes
  • avgNumOfDiffs - the average number of differences between pairwise comparisons between microhaplotypes
  • NucleotideDiversity - NucleotideDiversity
  • simpleAvalance -
  • completeAvalance -
  • nSegratingSites - the number of sites that differ between seqs
  • TajimaD - the Tajima’s D for these microhaplotypes
  • TajimaDPVal - the p-value for Tajima’s D calc

Output files tables examples

allDiversityMeasures.tsv.gz

Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/popGenetics/allDiversityMeasures.tsv.gz"))

allTranslatedDivMeasures.tsv.gz

Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/popGenetics/allTranslatedDivMeasures.tsv.gz"))

genomicLocsForKnownAAChanges.bed

Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/info/transcriptLocsForKnownAAChanges.bed"))

transcriptLocsForKnownAAChanges.bed

Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/info/transcriptLocsForKnownAAChanges.bed"))

targetsIntersectingWithGenesInfo.tsv

Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/info/targetsIntersectingWithGenesInfo.tsv"))

knownLocCoverage.tsv

Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/info/knownLocCoverage.tsv"))

unmmappedUntranslatableHapCounts.tsv.gz

Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/summary/unmappedUntranslatableHapCounts.tsv.gz"))

AAChangesInfo.tsv.gz

Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/summary/AAChangesInfo.tsv.gz", n_max = 100))

allSummaryTables.tab.txt.gz

Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/summary/allSummaryTables.tab.txt.gz", n_max = 100))
Source Code
---
title: "SeekDeep Variant Calling"
---

<script>
$(document).ready(function() {
    document.querySelectorAll('.downloadLink').forEach(function(e) { e.setAttribute('download', e.text); });
    document.querySelectorAll('.downloadLink').forEach(function(e) { e.innerHTML = '<i class="fa fa-download"></i>  ' + e.text; });
});
</script>

```{r setup, echo=FALSE, message=FALSE}
source("../common.R")
```


# Variant Calling 

Currently this program is available in the developmental version of SeekDeep (`git checkout develop` when installing) but will be included in future releases. 

Variant calling and protein variant calling can be done during the [processCluster step](processClusters_usage.qmd#variation-calling-and-translation) and can be done with input a results file with the minimum of 4 or 5 columns containing the following data: 

*  sample name
*  target name  
*  target population unique identifier  
*  read count  

An additional column can be supplied that has the sequence for the target or a directory with sequences for each target can be supplied. The directory should be set up so that each target from the target column has a file named [TARGET].fasta 

And a genome and gff file will be needed which will be used to call the variants and the gff annotation will be used to translate and annotate the protein variants as well. 


# `SeekDeep variantCallOnSeqAndProtein`

The program within SeekDeep is called `variantCallOnSeqAndProtein`. An example command is below. The data for the example below can be downloaded using the buttons below. This program requires that `bowtie2` (to align to genome) and `samtools` (for sorting bams) be installed. If a microhaplotype doesn't map it's excluded from the analysis and if a haplotype's translation ends up being managled (either from read shift mutation or bad haplotype call, it will be excluded from the protein variant calling, these unmappable and untranslatable will be marked out). The translations are done by mapping to the reference and trim off the introns from the alignment, because of this if haplotypes don't map well or if there is variation of intron haplotypes might not be able to be translated in this manner.   

```{r}
#| results: asis
#| echo: false

cat(createDownloadLink("allSelectedClustersInfo_drugAndTopDiv.tsv.gz"))
cat("\n\n")
cat(createDownloadLink("meta.tsv"))
cat("\n\n")
```

And the genome and gff used for this commands can be downloaded using the following 

```{bash, eval = F}
wget http://seekdeep.brown.edu/data/plasmodiumData/pf.tar.gz 
tar -zxvf pf.tar.gz
```

Or downloaded from git hub using `git-lfs`

```{bash, eval = F}
git clone https://github.com/nickjhathaway/pf_genomes.git
cd pf_genomes
./bin/unzip_data.sh
```


```{r}
#| results: asis
#| echo: false

cat(createDownloadLink("drug_resistant_aaPositions.tsv"))
cat("\n\n")
```


```{r}
pf_drug_resistant_aaPositions = readr::read_tsv("drug_resistant_aaPositions.tsv")
create_dt(pf_drug_resistant_aaPositions)
```


```{bash, eval = F}
SeekDeep variantCallOnSeqAndProtein --resultsFnp allSelectedClustersInfo_drugAndTopDiv.tsv.gz --genomeFnp pf/genomes/Pf3D7.fasta --gffFnp pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/drug_resistant_aaPositions.tsv --dout drugAndTopDiv_varCalls --overWriteDir  --metaFnp meta.tsv --numThreads 14 --getPairwiseComps

```

Please see `SeekDeep variantCallOnSeqAndProtein --help` to see all flags, will explain several of them here 

## Flags  

### Required  

*  **\-\-resultsFnp** - A table with at least 4 columns that supply: sample, target, popUIDForTarget, readCount, and also can have a column for sequence    
    *  Flags for setting which columns apply which of the data input above, the defaults are expected columns from a SeekDeep output  
        *  **\-\-sampleColName** -  Column Name; ***default***=**s_Sample**
        *  **\-\-targetNameColName** -  target Name Column Name, the column name in the table which indicates the different targets;***default***=**p_name**
        *  **\-\-popHapIdColName** -  a population level naming across samples for haplotypes for this target;***default***=**h_popUID**
        *  **\-\-popHapSeqColName** -  population Haplotype Sequence Column Name, the seq to call variants on;***default***=**h_Consensus**
        *  **\-\-withinSampleReadCntColName** -  within Sample Read Cnt Col Column Name;***default***=**c_ReadCnt**
*  **\-\-genome** - the path to a genome in FASTA format    
*  **\-\-gff** - the path to a gff file that annotates the genome fasta file  


### Optional   

*  **\-\-knownAminoAcidChangesFnp** - a tab delimited (tsv) file with know amino acid changes, should have at least 2 columns, 1 named either ID or transcriptID and the other named AAPosition (case insenitivte), additional columns can be present and optionally added into outpus as additional meta    
*  **\-\-metaFnp** - a file that supplies meta data, should have at least two columns, a column with the name sample, the names of which match the samples in the **\-\-resultsFnp** input file, this meta will be added to majority of outputs so it variants can be linked to output    
*  **\-\-getPairwiseComps** - compare microhaplotypes between each other  
*  **\-\-numThreads** - number of threads to add  
*  **\-\-dout** - the name of the output directory     
*  **\-\-overWriteDir** - over write the ouput directory if it already exists   
*  **\-\-aaSummaryRefInfoNextToAlt** - by default, one of the summary tables that covers detected amino acid changes from references can report the number of alternative and reference either stacked (**default**, e.g. numbers for both alternative and reference are in one column, **AA_cnt**, **AA_freq**) or next to each one (e.g. **reference_AA_cnt**, **reference_AA_freq**, **alternate_AA_cnt**, **alternate_AA_freq**), which is turned on by this flag

## Output  

The full output can be downloaded here: 
```{r}
#| results: asis
#| echo: false

cat(createDownloadLink("drugAndTopDiv_varCalls.tar.gz"))
cat("\n\n")
```

There will be a directory created for each target in the input and this will have a breakdown about the comparison of sequences within the target, diversity, and the variants. There will also be a directory called `reports` that combines the data from all targets. 

An example of the layout of the results of a target analysis 

```{bash}
tree -n drugAndTopDiv_varCalls/Pf07-0403507-0403717
```

The reports directory  
```{bash}
tree -n drugAndTopDiv_varCalls/reports
```


### VCF outputs:  

Of note, all GT (genotype) fields in the vcfs will be forced to a ploidy (controlled by `--ploidy` which defaults to 2)

*  **vcfs/allComplexGenomicVariantCalls.vcf.gz** - All genomic variants called, will make complex variant calls (which combines calls of close by SNPs controlled by `--complexVariantWithinDist` default = 12)
*  **vcfs/allGenomicVariantCalls.vcf.gz** - All genomic variants called, if variation is covered by multiple targets, cars are collapsed (which can be controlled by flags `--combineOverlappingCallsAcrossTargets`, by default the target with more read coverage for a sample will be used for variant, this flag just sums across all covering targets, by summing the read counts will interference with any copy number variation coverage) 
*  **vcfs/allProteinVariantCalls.vcf.gz** - All protein variants called, chromosomes are protein transcripts  
*  **vcfs/knownAAChangesGenomicVariantCalls.vcf.gz** - If positions supplied, will subset to just those positions of the genomic locations that cover those 
*  **vcfs/knownAAChangesProteinVariantCalls.vcf.gz** - If positions supplied, will subset to just those positions 


### Locations info: 

*  **info/transcriptLocsForKnownAAChanges.bed** - The transcripts bed file of known mutations if supplied, will also contain meta on what genomic locations this covers 
*  **info/genomicLocsForKnownAAChanges.bed** - The genomic locations that overlap the known mutations if supplied 
*  **info/targetsIntersectingWithGenesInfo.tsv** - A list of the genes intersected with what targets  (to see what specific amino acids are covered go to [drugAndTopDiv_varCalls/\[TARGET\]/variantCalling/variantCalls/translatedInput.bed](drugAndTopDiv_varCalls/Pf07-0403507-0403717/variantCalling/variantCalls/translatedInput.bed))
*  **info/knownLocCoverage.tsv** - if known AA changes are supplied, this will reports which of your targets cover each known location 

### summary: 

*  **summary/AAChangesInfo.tsv.gz** - A summary of per amino acid change detected in the data, if known amino acid locations are given with known alts, will force calls for these changes even if not observed (so 0's will be placed) 
*  **summary/allSummaryTables.tab.txt.gz** - Summary of translated sequences and the variants detected/known for each sequence see   
*  **summary/hapCountsPerSamplePerTarget.tsv.gz** - A target by sample table of the number of haplotypes per target  
*  **summary/readCountsPerSamplePerTarget.tsv.gz** - A target by sample table of the read depth per target  
*  **summary/unmmappedUntranslatableHapCounts.tsv.gz** - A count of the number of haplotypes per target that couldn't be mapped and of ones that could be mapped but not translated  


#### Column names explantation  

##### AAChangesInfo.tsv.gz  

Many of the columns below were created to reflect previous programs naming schemes and are subject to change as feedback about this table is taken into account.  

*  **Gene_ID** - The gene identifier within the gff provided   
*  **Gene_Transcript_ID** - the mRNA transcript ID with in the gff provided  
*  **Gene** - common gene name, ff it exists (e.g. AMA1, CSP, DHPS, CRT)   
*  **Mutation_Name** - A name given to the position, will be [Gene]-[Ref-tricode-amino-acid]-[Pos]-(Alt-tricode-amino-acid)   
*  **ExonicFunc** - The type of type observed e.g. missense_variant  
*  **AA_Change** - Similar to **Mutation_Name** but without the Gene name
*  **Targeted** - Whether or not this change is in the table provided of known changes   
*  **CoveredBy** - What haplotypes targets cover this position    
*  **sample** - the sample for the which the count data in the columns below belongs to  
*  **reference_AA_pos** - The reference AA position (1-based, to reflect what is commonly reported)  
*  **reference_AA** - What the reference AA is   
*  **AA** - the amino acid for the sample for this **reference_AA_pos**    
*  **is_reference** - Whether or not this **AA**  is the reference or not   
*  **AA_cnt** - the within sample read count for this **AA**    
*  **AA_freq** - the within sample read frequency [0-1] for this **AA**   
*  **AA_withinSampleCoverage** - The total reads covering this amino acid position (the sum of all reads for all AA detected for this **reference_AA_pos** for this **sample**)  
*  **AlleleCount** - The population level count of number of haplotypes/alleles with this **Gene_Transcript_ID**-**reference_AA_pos**-**AA**
*  **AlleleFrequency** - The population level frequency [0-1] of haplotypes/alleles with this **Gene_Transcript_ID**-**reference_AA_pos**-**AA**
*  **SampleCount** - The population level count samples with this **Gene_Transcript_ID**-**reference_AA_pos**-**AA** 
*  **SamplePrevalence** - The population level frequency [0-1] of samples with this **Gene_Transcript_ID**-**reference_AA_pos**-**AA** 

##### allSummaryTables.tab.txt.gz  

In addition to the columns below, any meta data supplied with `--metaFnp` will also be included in this below table. The table will have a line for each time a sequence supplied so several fields will be repeated for the unqiue sequences 

*  **CollapsedName** - A name given to the collapsed unique sequences from the input   
*  **seq** - The sequence for this haplotype 
*  **originalIdentifier** - the original identifier in the input table for this sequence   
*  **readCount** - The read count for the sample for this line  
*  **sample** - the sample name in which this haplotype was found  
*  **target** - the target name for this sequence  
*  **chrom** - the genomic chromosome of this target  
*  **0based_start** - 0 based start position of the start of this haplotype in the genome supplied  
*  **0based_end** - the 0 based end postion (non-inclusive) of this haplotype in the genome supplied  
*  **length** - the length of the region this haplotype maps to 
*  **strand** - the strand the haplotype mapped to  
*  **transcript** - if this haplotype intersects any coding regions, the name of the transcript it overlaps with  
*  **translatedSeq** - if this haplotype intersects any coding regions, this will be the translated sequence (which is determine by trim to just coding DNA sequence)  
*  **transcript_1based_start** -  the start within the protein if it intersects any coding regions (1-based) 
*  **transcript_1based_end** - the end within the protein if it intersects any coding regions (1-based, inclusive) 
*  **transcript_length** - the length of the protein region the haplotype maps to if it maps to a coding sequence   
*  **transcript_knownAATyped** - if known amino acids are provided, this will have the calls for those positions (format is REFAA-Pos when it matches reference and if it has an alt it will be RefAA-Pos-AltAA)  
*  **transcript_fullAATyped** - this will have the calls of any variant determined in the population as well as any known supplied (format is REFAA-Pos when it matches reference and if it has an alt it will be RefAA-Pos-AltAA)  



### Diversity measures: 

*  **popGenetics/allDiversityMeasures.tsv.gz** - Several diversity measures calculated for each target  
*  **popGenetics/allTranslatedDivMeasures.tsv.gz** - The same diversity measures as above but for the translated sequences  

#### Column names explantation  

*  **id** - the name of the target    
*  **totalHaplotypes** - the total number of targets (this is different than number of samples since samples could have multiple unique haplotypes)    
*  **uniqueHaplotypes** - the total number of unique microhaplotypes   
*  **singlets** - the number of  microhaplotypes that are found only once    
*  **doublets** - the number of  microhaplotypes that are found at least twice  
*  **expShannonEntropy** - exp(ShannonEntropy) 
*  **ShannonEntropyE** - [ShannonEntropyE](https://www.sciencedirect.com/topics/engineering/shannon-entropy#:~:text=The%20Shannon%20entropy%20S%20(%20x,new%20value%20in%20the%20process.)  
*  **effectiveNumOfAlleles** - effective Num Of Alleles, would have to gather at least this number of haplotypes from the same population to get similar diversity measures as seen here  
*  **SimpsonIndex** - [Simpson Index of diversity](https://en.wikipedia.org/wiki/Diversity_index#Simpson_index)  
*  **he** - expected heterozygosity (the chance that if you drew ***two*** microhaplotypes at random from the population that they are different) 
*  **ExpP3** - the chance that if you drew ***three*** microhaplotypes at random from the population that they are three different microhaplotypes    
*  **ExpP4** - the chance that if you drew ***four*** microhaplotypes at random from the population that they are four different microhaplotypes  
*  **ExpP5** - the chance that if you drew ***five*** microhaplotypes at random from the population that they are five different microhaplotypes  
*  **lengthPolymorphism** - whether there is significant length polymorphism in the microhaplotypes  

if **\-\-getPairwiseComps** flag is used 

*  **avgPercentID** - the average percent identity between pairwise comparisons between microhaplotypes  
*  **avgNumOfDiffs** - the average number of differences between pairwise comparisons between microhaplotypes     
*  **NucleotideDiversity** - [NucleotideDiversity](https://en.wikipedia.org/wiki/Nucleotide_diversity)
*  **simpleAvalance** -  
*  **completeAvalance** -  
*  **nSegratingSites** - the number of sites that differ between seqs   
*  **TajimaD** - the Tajima's D for these microhaplotypes    
*  **TajimaDPVal** - the p-value for Tajima's D calc 

### Output files tables examples  

#### allDiversityMeasures.tsv.gz  

```{r}
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/popGenetics/allDiversityMeasures.tsv.gz"))
```

#### allTranslatedDivMeasures.tsv.gz  

```{r}
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/popGenetics/allTranslatedDivMeasures.tsv.gz"))
```

#### genomicLocsForKnownAAChanges.bed  

```{r}
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/info/transcriptLocsForKnownAAChanges.bed"))
```

#### transcriptLocsForKnownAAChanges.bed  

```{r}
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/info/transcriptLocsForKnownAAChanges.bed"))
```

#### targetsIntersectingWithGenesInfo.tsv  

```{r}
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/info/targetsIntersectingWithGenesInfo.tsv"))
```

#### knownLocCoverage.tsv  

```{r}
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/info/knownLocCoverage.tsv"))
```


#### unmmappedUntranslatableHapCounts.tsv.gz  

```{r}
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/summary/unmappedUntranslatableHapCounts.tsv.gz"))
```




#### AAChangesInfo.tsv.gz  

```{r}
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/summary/AAChangesInfo.tsv.gz", n_max = 100))
```

#### allSummaryTables.tab.txt.gz  

```{r}
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/summary/allSummaryTables.tab.txt.gz", n_max = 100))
```