SeekDeep
  • Home
  • Installing
    • Mac OS
    • Ubuntu
    • Windows
    • Vagrant/virtual image (any system)
  • Code
    • Github
  • Usages

    • Pipeline
    • extractor/extractorPairedEnd
    • makeSampleDirectories
    • qluster
    • processClusters
    • popClusteringViewer

    • Pipeline Wrapper
    • setupTarAmpAnalysis

    • Utilities
    • genTargetInfoFromGenomes
    • SeekDeep control mixture benchmarking
    • SeekDeep Variant Calling
  • Tutorials

    • Single End
    • Ion Torrent with MIDs

    • Illumina Paired End
    • Paired End No MIDs/Barcodes
    • Paired End With MIDs/Barcodes
  • Misc Info
    • Illumina Paired End Info
  • References
    • versions
    • References
  1. SeekDeep control mixture benchmarking

Contents

  • Utilites for benchmarking performance on control mixtures
    • Shared flags
      • Required
        • –sampleToMixture example
        • –mixtureSetUp example
      • Optional
    • SeekDeep benchmarkTarAmpControlMixtures
      • Required
    • SeekDeep benchmarkMultiTarAmpControlMixtures
  • Outputs
    • performancePerTarget.tsv
    • classifiedHaplotypes.tsv
    • falseHaplotypesComparedToExpected.tsv
    • falseHaplotypesComparedToExpected.tsv

SeekDeep control mixture benchmarking

  • Show All Code
  • Hide All Code

  • View Source

Utilites for benchmarking performance on control mixtures

When running targeted amplicon analysis it’s helpful to use known control mixtures with known expected sequences and frequencies in order to test different programs, settings, lab experiments designs, etc. For this purpose several utilities were created and added to SeekDeep to help evaluate performance on several common metrics which will be explained below.

The three utilities, which only different slightly in input are:

  1. SeekDeep benchmarkTarAmpControlMixtures - Benchmark a single amplicon target
  2. SeekDeep benchmarkMultiTarAmpControlMixtures - Benchmark several amplicon targets at once
  3. SeekDeep benchmarkControlMixturesOnProcessedClustersDir - Benchmark on the output of the SeekDeep processClusters results dir, which can be either single or multiple target

Shared flags

Required

Requirement to put in a file that specifices which samples are control mixtures and what mixtures they contain, and a separator file that contains what strains and at what relative abundance they are in the mixture.

  • --sampleToMixture - Sample To Mixture, 2 columns 1)sample, 2)MixName
  • --mixtureSetUp - Mixture Set Up, 3 columns 1)MixName, 2)strain, 3)relative_abundance
    • The relative_abundance column can be any numbers and the frequencies will be calculated based off the sum, e.g. if there were 3 strains each with a relative abundance of 1 in this column then the expected frequencies would be 33.3% for all three strains,

–sampleToMixture example

Code
sampleToMixture = readr::read_tsv("benchmarking/samplesToMixFnp.tsv")
create_dt(sampleToMixture)

samplesToMixFnp.tsv

–mixtureSetUp example

Code
mixSetUp = readr::read_tsv("benchmarking/mixSetUpFnp.tsv")
create_dt(mixSetUp)

mixSetUpFnp.tsv

Optional

  • --skipMissingSamples - By default if a control sample is missing the program will error out in case leaving out the control sample was a mistake, but sometimes a sample doesn’t get enough reads in analysis so might be missing for that reason, for those cases can use this flag to simply skip missing samples rather than erroring out
  • --fillInMissingSamples - If using --skipMissingSamples then you can add this flag to fill in the table with just 0’s for all metrics
  • --metaFnp - you can optionally add any associated meta data with the same to the output, input should be a tab deliminted file with at least 1 column titled sample and this column’s values match up with the sample column in --sampleToMixture
  • --newColumnName - can optionally add one column value to all outputs, can be multiple values separated by commas, this will be the column names, --newColumnElement will provide the actual values, e.g. --newColumnName Program,Technology --newColumnElement SeekDeep,Illumina etc
  • --newColumnElement - The elements to fill the new columns if adding any via –newColumnName

SeekDeep benchmarkTarAmpControlMixtures

This is for benchmarking a singular amplicon target analysis, see below

Required

  • --resultsFnp - results tab delimited file, each row is a haplotype, should have at least 4 columns, 1) sample (–sampleColName), 2)within sample freq (–withinSampleFreqColName), 3)within sample read count (–withinSampleReadCntColName), 4)haplotype pop ID (–popHapIdColName), optionally 4th col with hap sequence (–popHapSeqColName) or read in from –popSeqsFnp
    • Column names in the results file can be controlled by the following flags, the defaults are common columns used in SeekDeep outputs:
      • --sampleColName - The column name that contains the sample names (default:s_Sample)
      • --withinSampleFreqColName - The column name that contains the within sample frequency for the haplotype (should be between 0-1) (default:c_AveragedFrac)
      • --withinSampleReadCntColName - The column name that contains the within sample read count for the haplotype, (can be any numeric value) (default:c_ReadCnt)
      • --popHapIdColName - The column name that contains the unique population identifier for this haplotype (default:h_popUID)
      • --popHapSeqColName - The column name that contains sequence of the haplotype (default:h_popUID), this will be ignored if --popSeqsFnp is provided
    • --popSeqsFnp - if the sequences for the results aren’t within the file they can be added by pointing to a fasta file via --popSeqsFnp, fasta record names should match --popHapIdColName
  • --expectedSeqsFnp - A fasta file with the expected haplotype for this target, the name of the fasta record should match the strain column in the --mixtureSetUp table, if a strain is completely missing a target a stand fasta record can be put in and it’s sequence as all Ns, this will indicate that mixtures with this strain won’t ever be able to detect strain given there’s no sequence to detect
  • --name - The name to be given to this analysis

Examples

Code

SeekDeep benchmarkTarAmpControlMixtures --resultsFnp analysis/selectedClustersInfo.tab.txt.gz --expectedSeqsFnp ../refSeqsTrimmed/t9_split.fasta --name t9 --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir 

SeekDeep benchmarkTarAmpControlMixtures --resultsFnp analysis/selectedClustersInfo.tab.txt.gz --expectedSeqsFnp ../refSeqsTrimmed/t9_split.fasta --name t9 --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples --newColumnElement SeekDeep --newColumnName Program --metaFnp ../misc/paragon_sampleMeta.tab.txt

SeekDeep benchmarkMultiTarAmpControlMixtures

Very similar to SeekDeep benchmarkTarAmpControlMixtures, the only difference are that there is an additional required column to indicate target and the input expected/observed inputs are directories within which should be a fasta file with TARGET_NAME.fasta

  • --resultsFnp - results tab delimited file, each row is a haplotype, should have at least 5 columns, 1) sample (–sampleColName), 2)within sample freq (–withinSampleFreqColName), 3)within sample read count (–withinSampleReadCntColName), 4)haplotype pop ID (–popHapIdColName), 5)target name column (–targetNameColName), optionally 4th col with hap sequence (–popHapSeqColName) or read in from –popSeqsDirFnp
    • Column names in the results file can be controlled by the following flags, the defaults are common columns used in SeekDeep outputs:
      • --sampleColName - The column name that contains the sample names (default:s_Sample)
      • --withinSampleFreqColName - The column name that contains the within sample frequency for the haplotype (should be between 0-1) (default:c_AveragedFrac)
      • --withinSampleReadCntColName - The column name that contains the within sample read count for the haplotype, (can be any numeric value) (default:c_ReadCnt)
      • --popHapIdColName - The column name that contains the unique population identifier for this haplotype (default:h_popUID)
      • --targetNameColName - The column name that contains the target amplicon name (default:p_name)
      • --popHapSeqColName - The column name that contains sequence of the haplotype (default:h_popUID), this will be ignored if --popSeqsDirFnp is provided
    • --popSeqsDirFnp - if the sequences for the results aren’t within the results file they can be added by pointing to a directory of fasta files via --popSeqsDirFnp, fasta files should be named TARGET_NAME.fasta and fasta record names should match --popHapIdColName
  • --expectedSeqsDirFnp - A directory of fasta files with the expected haplotype for each target, named TARGET_NAME.fasta, the name of the fasta record should match the strain column in the --mixtureSetUp table, if a strain is completely missing a target a stand fasta record can be put in and it’s sequence as all Ns, this will indicate that mixtures with this strain won’t ever be able to detect strain given there’s no sequence to detect

Can also set up different mixture expectation per target for when the mixture setup is more complex than above with simple lab strains, this is done by adding the --targetNameColName to the --sampleToMixture table and --mixtureSetUp table, like below:

Code
allFieldMixtureInfo = readr::read_tsv("benchmarking/allFieldMixtureInfo.tsv")
create_dt(allFieldMixtureInfo)
Code
allFieldSampNameToMixName = readr::read_tsv("benchmarking/allFieldMixtureInfo.tsv")
create_dt(allFieldSampNameToMixName)

allFieldMixtureInfo.tsv allFieldMixtureInfo.tsv

Examples

Code

SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir 

SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples

SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples --newColumnElement SeekDeep --newColumnName Program

Outputs

All three programs will have the same output.

Output files:
columns explained below

  • performancePerTarget.tsv - Sum of performances per sample per target,
  • classifiedHaplotypes.tsv - Each haplotype in the control samples classified as either matching expected or being a false haplotype, and associated information
  • falseHaplotypesComparedToExpected.tsv - All the false haplotypes compare to the expected haplotypes with information about how many errors off from expected it is
  • falseHaplotypesComparedToOthers.tsv - All the false haplotypes compare to the all other haplotypes for this target in the population with information about how closely it matches and what the difference in frequencies is
  • missingSamples.txt - If --skipMissingSamples this will have the samples that were missing
  • samplesToMix.tsv - This is a copy of the input --sampleToMixture input, copied over to help with downstream analysis
  • mixSetUps.tsv - This has the information on the mixtures --mixtureSetUp input for the mixtures found within --sampleToMixture
  • process.qmd - A quarto document will some R code for some basic plots of the performance reported as a starting point to modify to add more specific code

performancePerTarget.tsv

Summation of performance each sample per target.

  • AnalysisName - The name of the analysis, most commonly the target amplicon name
  • sample - The name of the control sample
  • mix - The name of the mixture for this sample
  • totalReads - The total number of reads for this sample
  • recoveredExpectedHaps - The number of haplotypes for this sample that matched an expected haplotype
  • falseHaps - The number of haplotypes for this samples that don’t match an expected haplotype
  • totalHaps - The total number of final haplotypes for this sample (recoveredExpectedHaps + falseHaps)
  • totalExpectedHaps - The total number of haplotypes that were expected for the mixture for this sample
  • expectedHapRecoveryRate - The recovery rate of the expected haplotypes (calculated recoveredExpectedHaps/totalExpectedHaps)
  • falseHapsRate - The rate of false haplotypes for this sample (calculated falseHaps/totalHaps)
  • expectedMissing - If there are any expected haplotypes that are missing, will be reported here, semi-colon(;) separated
  • RMSE - The root mean squared error for this sample, (calculated by taking the sum of squares of the differences of expected frequencies vs the observed frequencies of the recovered haplotypes and then taking the square root of the mean of that sum), the closer to zero the closer overall the mixture’s expected frequencies were predictedexplanation here

classifiedHaplotypes.tsv

This breaks down the performance per haplotype within in sample.

  • AnalysisName - The name of the analysis, most commonly the target amplicon name
  • sample - The name of the control sample
  • mix - The name of the mixture for this sample
  • InputReadName - The name of the haplotype
  • HapPopUID - The population unique identify associated with this haplotype
  • HapSampleCount - The number of samples this HapPopUID appears in (can be helpful to see if it’s in one sample only or appears in multiple samples)
  • readCnt - The number of reads for this InputReadName
  • frac - The within sample frequency of this InputReadName
  • matchExpected - true if matches expected or false if it doesn’t
  • expectedRef - The number of the matching expected haplotype if matchExpected==true
  • expectedFrac - The expected frequency of the expected haplotype
  • MajorOrMinor - Whether or not the expected haplotype was the Major haplotype in the mixture (if it had the highest relative abundance, haps with same abundance that are both max are both labeled as major haplotype) or a Minor haplotype

falseHaplotypesComparedToExpected.tsv

A break down of all the haplotypes that didn’t match expected sequences compare to the expected sequences (will have a comparison for each expected so each haplotype has multiple rows)

  • AnalysisName - The name of the analysis, most commonly the target amplicon name
  • sample - The name of the control sample
  • mix - The name of the mixture for this sample
  • InputReadName - The name of the haplotype
  • readCnt - The number of reads for this InputReadName
  • frac - The within sample frequency of this InputReadName
  • RefName - The name of the expected haplotype being compared to
  • ExpectedRefFreq - Frequency of the expected haplotype being compared to
  • ExpectedMajorOrMinor - Whether the expected haplotype being compared to is the Major or Minor haplotype within the mixture
  • IdentityScore - The percent identity score to the expected haplotype being compared to
  • bestMatchScore - true/false for whether this identity score is the best match
  • mismatches - The number of mismatches
  • oneBaseIndels - The number of 1 base indels
  • twoBaseIndels - The number of 2 base indels
  • largeIndels - The number of >3 base indels
  • totalErrors - The total number of errors (mismatches + oneBaseIndels + twoBaseIndels + largeIndels)

falseHaplotypesComparedToExpected.tsv

Similar to falseHaplotypesComparedToExpected.tsv but instead of comparing to the expected haplotypes, it compares to the rest of the haplotypes within the same sample. this can be helpful to see if there are a bunch of 1-off from the major haplotype within the sample

  • AnalysisName - The name of the analysis, most commonly the target amplicon name
  • sample - The name of the control sample
  • mix - The name of the mixture for this sample
  • InputReadName - The name of the haplotype
  • readCnt - The number of reads for this InputReadName
  • frac - The within sample frequency of this InputReadName
  • OtherName - The name of the other haplotype within the sample being compared to
  • OtherReadCnt - The read count of the other haplotype within the sample being compared to
  • OtherFrac - The frequency of the other haplotype within the sample being compared to
  • ratio - The ratio in frequencies between the two haplotypes being compared (calculated OtherFrac/frac)
  • OtherMajor - Whether or not the other haplotype is the major haplotype in the sample (e.g. has the highest frequency )
  • OtherMatchesExpected - Whether the other haplotype matches an expected haplotype
  • OtherExpectedMatchName - If the other haplotype matches an expected haplotype
  • IdentityScore - The percent identity score to the expected haplotype being compared to
  • bestMatchScore - true/false for whether this identity score is the best match
  • mismatches - The number of mismatches
  • oneBaseIndels - The number of 1 base indels
  • twoBaseIndels - The number of 2 base indels
  • largeIndels - The number of >3 base indels
  • totalErrors - The total number of errors (mismatches + oneBaseIndels + twoBaseIndels + largeIndels)

Source Code
---
title: "SeekDeep control mixture benchmarking"
---

<script>
$(document).ready(function() {
    document.querySelectorAll('.downloadLink').forEach(function(e) { e.setAttribute('download', e.text); });
    document.querySelectorAll('.downloadLink').forEach(function(e) { e.innerHTML = '<i class="fa fa-download"></i>  ' + e.text; });
});
</script>




```{r setup, echo=FALSE, message=FALSE}
source("../common.R")
```


# Utilites for benchmarking performance on control mixtures 

When running targeted amplicon analysis it's helpful to use known control mixtures with known expected sequences and frequencies in order to test different programs, settings, lab experiments designs, etc. For this purpose several utilities were created and added to `SeekDeep` to help evaluate performance on several common metrics which will be explained below. 

The three utilities, which only different slightly in input are: 


1.  `SeekDeep benchmarkTarAmpControlMixtures` - Benchmark a single amplicon target   
1.  `SeekDeep benchmarkMultiTarAmpControlMixtures` - Benchmark several amplicon targets at once    
1.  `SeekDeep benchmarkControlMixturesOnProcessedClustersDir` - Benchmark on the output of the `SeekDeep processClusters` results dir, which can be either single or multiple target  

## Shared flags 

### Required  
Requirement to put in a file that specifices which samples are control mixtures and what mixtures they contain, and a separator file that contains what strains and at what relative abundance they are in the mixture.  


*  **\-\-sampleToMixture** - Sample To Mixture, 2 columns 1)sample, 2)MixName     
*  **\-\-mixtureSetUp** - Mixture Set Up, 3 columns 1)MixName, 2)strain, 3)relative_abundance    
    *  The relative_abundance column can be any numbers and the frequencies will be calculated based off the sum, e.g. if there were 3 strains each with a relative abundance of 1 in this column then the expected frequencies would be 33.3% for all three strains, 

#### --sampleToMixture example 

```{r}
sampleToMixture = readr::read_tsv("benchmarking/samplesToMixFnp.tsv")
create_dt(sampleToMixture)
```

```{r}
#| results: asis
#| echo: false

cat(createDownloadLink("benchmarking/samplesToMixFnp.tsv"))
```

#### --mixtureSetUp example 

```{r}
mixSetUp = readr::read_tsv("benchmarking/mixSetUpFnp.tsv")
create_dt(mixSetUp)
```

```{r}
#| results: asis
#| echo: false

cat(createDownloadLink("benchmarking/mixSetUpFnp.tsv"))
```

### Optional  

*  **\-\-skipMissingSamples** - By default if a control sample is missing the program will error out in case leaving out the control sample was a mistake, but sometimes a sample doesn't get enough reads in analysis so might be missing for that reason, for those cases can use this flag to simply skip missing samples rather than erroring out     
*  **\-\-fillInMissingSamples** - If using `--skipMissingSamples` then you can add this flag to fill in the table with just 0's for all metrics    
*  **\-\-metaFnp** - you can optionally add any associated meta data with the same to the output, input should be a tab deliminted file with at least 1 column titled **sample** and this column's values match up with the sample column in `--sampleToMixture`      
*  **\-\-newColumnName** - can optionally add one column value to all outputs, can be multiple values separated by commas, this will be the column names, `--newColumnElement` will provide the actual values, e.g. `--newColumnName Program,Technology --newColumnElement SeekDeep,Illumina` etc    
*  **\-\-newColumnElement** - The elements to fill the new columns if adding any via --newColumnName     
 


## `SeekDeep benchmarkTarAmpControlMixtures` {#SeekDeep-benchmarkTarAmpControlMixtures}  

This is for benchmarking a singular amplicon target analysis, see [below](#SeekDeep-benchmarkMultiTarAmpControlMixtures) 

### Required  

*  **\-\-resultsFnp** - results tab delimited file, each row is a haplotype, should have at least 4 columns, 1) sample (--sampleColName), 2)within sample freq (--withinSampleFreqColName), 3)within sample read count (--withinSampleReadCntColName), 4)haplotype pop ID (--popHapIdColName), optionally 4th col with hap sequence (--popHapSeqColName) or read in from --popSeqsFnp   
    *  Column names in the results file can be controlled by the following flags, the defaults are common columns used in `SeekDeep` outputs:
        *  **\-\-sampleColName** - The column name that contains the sample names (default:s_Sample)  
        *  **\-\-withinSampleFreqColName** - The column name that contains the within sample frequency for the haplotype (should be between 0-1) (default:c_AveragedFrac) 
        *  **\-\-withinSampleReadCntColName** - The column name that contains the within sample read count for the haplotype, (can be any numeric value) (default:c_ReadCnt)
        *  **\-\-popHapIdColName** - The column name that contains the unique population identifier for this haplotype (default:h_popUID)
        *  **\-\-popHapSeqColName** - The column name that contains sequence of the haplotype (default:h_popUID), this will be ignored if `--popSeqsFnp` is provided 
    *  **\-\-popSeqsFnp** - if the sequences for the results aren't within the file they can be added by pointing to a fasta file via `--popSeqsFnp`, fasta record names should match **\-\-popHapIdColName**    
*  **\-\-expectedSeqsFnp** - A fasta file with the expected haplotype for this target, the name of the fasta record should match the `strain` column in the **\-\-mixtureSetUp** table, if a strain is completely missing a target a stand fasta record can be put in and it's sequence as all **N**s, this will indicate that mixtures with this strain won't ever be able to detect strain given there's no sequence to detect    
*  **\-\-name** - The name to be given to this analysis        
  

Examples  

```{bash, eval = F}

SeekDeep benchmarkTarAmpControlMixtures --resultsFnp analysis/selectedClustersInfo.tab.txt.gz --expectedSeqsFnp ../refSeqsTrimmed/t9_split.fasta --name t9 --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir 

SeekDeep benchmarkTarAmpControlMixtures --resultsFnp analysis/selectedClustersInfo.tab.txt.gz --expectedSeqsFnp ../refSeqsTrimmed/t9_split.fasta --name t9 --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples --newColumnElement SeekDeep --newColumnName Program --metaFnp ../misc/paragon_sampleMeta.tab.txt

```

## `SeekDeep benchmarkMultiTarAmpControlMixtures` {#SeekDeep-benchmarkMultiTarAmpControlMixtures}  

Very similar to [`SeekDeep benchmarkTarAmpControlMixtures`](#SeekDeep-benchmarkTarAmpControlMixtures), the only difference are that there is an additional required column to indicate target and the input expected/observed inputs are directories within which should be a fasta file with TARGET_NAME.fasta 

*  **\-\-resultsFnp** - results tab delimited file, each row is a haplotype, should have at least 5 columns, 1) sample (--sampleColName), 2)within sample freq (--withinSampleFreqColName), 3)within sample read count (--withinSampleReadCntColName), 4)haplotype pop ID (--popHapIdColName), 5)target name column (--targetNameColName), optionally 4th col with hap sequence (--popHapSeqColName) or read in from --popSeqsDirFnp    
    *  Column names in the results file can be controlled by the following flags, the defaults are common columns used in `SeekDeep` outputs:  
        *  **\-\-sampleColName** - The column name that contains the sample names (default:s_Sample)   
        *  **\-\-withinSampleFreqColName** - The column name that contains the within sample frequency for the haplotype (should be between 0-1) (default:c_AveragedFrac)   
        *  **\-\-withinSampleReadCntColName** - The column name that contains the within sample read count for the haplotype, (can be any numeric value) (default:c_ReadCnt)  
        *  **\-\-popHapIdColName** - The column name that contains the unique population identifier for this haplotype (default:h_popUID)  
        *  **\-\-targetNameColName** - The column name that contains the target amplicon name (default:p_name)  
        *  **\-\-popHapSeqColName** - The column name that contains sequence of the haplotype (default:h_popUID), this will be ignored if `--popSeqsDirFnp` is provided 
    *  **\-\-popSeqsDirFnp** - if the sequences for the results aren't within the results file they can be added by pointing to a directory of fasta files via `--popSeqsDirFnp`, fasta files should be named TARGET_NAME.fasta and fasta record names should match **\-\-popHapIdColName**    
*  **\-\-expectedSeqsDirFnp** - A directory of fasta files with the expected haplotype for each target, named TARGET_NAME.fasta, the name of the fasta record should match the `strain` column in the **\-\-mixtureSetUp** table, if a strain is completely missing a target a stand fasta record can be put in and it's sequence as all **N**s, this will indicate that mixtures with this strain won't ever be able to detect strain given there's no sequence to detect    

Can also set up different mixture expectation per target for when the mixture setup is more complex than above with simple lab strains, this is done by adding the **\-\-targetNameColName** to the  **\-\-sampleToMixture** table and **\-\-mixtureSetUp** table, like below: 

```{r}
allFieldMixtureInfo = readr::read_tsv("benchmarking/allFieldMixtureInfo.tsv")
create_dt(allFieldMixtureInfo)

allFieldSampNameToMixName = readr::read_tsv("benchmarking/allFieldMixtureInfo.tsv")
create_dt(allFieldSampNameToMixName)
```

```{r}
#| results: asis
#| echo: false

cat(createDownloadLink("benchmarking/allFieldMixtureInfo.tsv"))
cat("  ")
cat(createDownloadLink("benchmarking/allFieldMixtureInfo.tsv"))
```


Examples  

```{bash, eval = F}

SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir 

SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples

SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples --newColumnElement SeekDeep --newColumnName Program

```




# Outputs  
All three programs will have the same output. 

Output files:  
columns explained below 

*  performancePerTarget.tsv - Sum of performances per sample per target,    
*  classifiedHaplotypes.tsv - Each haplotype in the control samples classified as either matching expected or being a false haplotype, and associated information  
*  falseHaplotypesComparedToExpected.tsv - All the false haplotypes compare to the expected haplotypes with information about how many errors off from expected it is    
*  falseHaplotypesComparedToOthers.tsv - All the false haplotypes compare to the all other haplotypes for this target in the population with information about how closely it matches and what the difference in frequencies is    
*  missingSamples.txt - If `--skipMissingSamples` this will have the samples that were missing    
*  samplesToMix.tsv - This is a copy of the input `--sampleToMixture` input, copied over to help with downstream analysis     
*  mixSetUps.tsv - This has the information on the mixtures `--mixtureSetUp` input for the mixtures found within `--sampleToMixture`      
*  process.qmd - A quarto document will some R code for some basic plots of the performance reported as a starting point to modify to add more specific code     

## performancePerTarget.tsv  

Summation of performance each sample per target.  

*  **AnalysisName** - The name of the analysis, most commonly the target amplicon name    
*  **sample** - The name of the control sample    
*  **mix** - The name of the mixture for this sample    
*  **totalReads** - The total number of reads for this sample    
*  **recoveredExpectedHaps** - The number of haplotypes for this sample that matched an expected haplotype    
*  **falseHaps** - The number of haplotypes for this samples that don't match an expected haplotype    
*  **totalHaps** - The total number of final haplotypes for this sample (`recoveredExpectedHaps + falseHaps`)  
*  **totalExpectedHaps** - The total number of haplotypes that were expected for the mixture for this sample   
*  **expectedHapRecoveryRate** - The recovery rate of the expected haplotypes (calculated **recoveredExpectedHaps**/**totalExpectedHaps**)  
*  **falseHapsRate** - The rate of false haplotypes for this sample (calculated **falseHaps**/**totalHaps**)  
*  **expectedMissing** - If there are any expected haplotypes that are missing, will be reported here, semi-colon(;) separated   
*  **RMSE** - The root mean squared error for this sample, (calculated by taking the sum of squares of the differences of expected frequencies vs the observed frequencies of the recovered haplotypes and then taking the square root of the mean of that sum), the closer to zero the closer overall the mixture's expected frequencies were predicted[explanation here](https://en.wikipedia.org/wiki/Root-mean-square_deviation)

## classifiedHaplotypes.tsv  

This breaks down the performance per haplotype within in sample. 

*  **AnalysisName** - The name of the analysis, most commonly the target amplicon name    
*  **sample** - The name of the control sample    
*  **mix** - The name of the mixture for this sample    
*  **InputReadName** - The name of the haplotype   
*  **HapPopUID** - The population unique identify associated with this haplotype    
*  **HapSampleCount** - The number of samples this **HapPopUID** appears in (can be helpful to see if it's in one sample only or appears in multiple samples) 
*  **readCnt** - The number of reads for this **InputReadName**   
*  **frac** - The within sample frequency of this **InputReadName**    
*  **matchExpected** - **true** if matches expected or **false** if it doesn't    
*  **expectedRef** - The number of the matching expected haplotype if **matchExpected**==true  
*  **expectedFrac** - The expected frequency of the expected haplotype   
*  **MajorOrMinor** - Whether or not the expected haplotype was the Major haplotype in the mixture (if it had the highest relative abundance, haps with same abundance that are both max are both labeled as major haplotype) or a Minor haplotype   


## falseHaplotypesComparedToExpected.tsv  

A break down of all the haplotypes that didn't match expected sequences compare to the expected sequences (will have a comparison for each expected so each haplotype has multiple rows)

*  **AnalysisName** - The name of the analysis, most commonly the target amplicon name    
*  **sample** - The name of the control sample    
*  **mix** - The name of the mixture for this sample    
*  **InputReadName** - The name of the haplotype   
*  **readCnt** - The number of reads for this **InputReadName**   
*  **frac** - The within sample frequency of this **InputReadName**    
*  **RefName** - The name of the expected haplotype being compared to   
*  **ExpectedRefFreq** - Frequency of the expected haplotype being compared to  
*  **ExpectedMajorOrMinor** - Whether the expected haplotype being compared to is the Major or Minor haplotype within the mixture    
*  **IdentityScore** - The percent identity score to the expected haplotype being compared to   
*  **bestMatchScore** - true/false for whether this identity score is the best match   
*  **mismatches** - The number of mismatches  
*  **oneBaseIndels** - The number of 1 base indels    
*  **twoBaseIndels** - The number of 2 base indels  
*  **largeIndels** - The number of >3 base indels  
*  **totalErrors** - The total number of errors (`mismatches + oneBaseIndels + twoBaseIndels + largeIndels`) 

## falseHaplotypesComparedToExpected.tsv  

Similar to **falseHaplotypesComparedToExpected.tsv** but instead of comparing to the expected haplotypes, it compares to the rest of the haplotypes within the same sample. this can be helpful to see if there are a bunch of 1-off from the major haplotype within the sample  


*  **AnalysisName** - The name of the analysis, most commonly the target amplicon name    
*  **sample** - The name of the control sample    
*  **mix** - The name of the mixture for this sample    
*  **InputReadName** - The name of the haplotype   
*  **readCnt** - The number of reads for this **InputReadName**   
*  **frac** - The within sample frequency of this **InputReadName**   
*  **OtherName** - The name of the other haplotype within the sample being compared to   
*  **OtherReadCnt** - The read count of the other haplotype within the sample being compared to  
*  **OtherFrac** - The frequency of the other haplotype within the sample being compared to    
*  **ratio** - The ratio in frequencies between the two haplotypes being compared (calculated **OtherFrac**/**frac**)  
*  **OtherMajor** - Whether or not the other haplotype is the major haplotype in the sample (e.g. has the highest frequency )  
*  **OtherMatchesExpected** - Whether the other haplotype matches an expected haplotype   
*  **OtherExpectedMatchName** - If the other haplotype matches an expected haplotype  
*  **IdentityScore** - The percent identity score to the expected haplotype being compared to   
*  **bestMatchScore** - true/false for whether this identity score is the best match   
*  **mismatches** - The number of mismatches  
*  **oneBaseIndels** - The number of 1 base indels    
*  **twoBaseIndels** - The number of 2 base indels  
*  **largeIndels** - The number of >3 base indels  
*  **totalErrors** - The total number of errors (`mismatches + oneBaseIndels + twoBaseIndels + largeIndels`) 


{{< fa dna >}}