Code
sampleToMixture = readr::read_tsv("benchmarking/samplesToMixFnp.tsv")
create_dt(sampleToMixture)
When running targeted amplicon analysis it’s helpful to use known control mixtures with known expected sequences and frequencies in order to test different programs, settings, lab experiments designs, etc. For this purpose several utilities were created and added to SeekDeep
to help evaluate performance on several common metrics which will be explained below.
The three utilities, which only different slightly in input are:
SeekDeep benchmarkTarAmpControlMixtures
- Benchmark a single amplicon targetSeekDeep benchmarkMultiTarAmpControlMixtures
- Benchmark several amplicon targets at onceSeekDeep benchmarkControlMixturesOnProcessedClustersDir
- Benchmark on the output of the SeekDeep processClusters
results dir, which can be either single or multiple targetRequirement to put in a file that specifices which samples are control mixtures and what mixtures they contain, and a separator file that contains what strains and at what relative abundance they are in the mixture.
sampleToMixture = readr::read_tsv("benchmarking/samplesToMixFnp.tsv")
create_dt(sampleToMixture)
mixSetUp = readr::read_tsv("benchmarking/mixSetUpFnp.tsv")
create_dt(mixSetUp)
--skipMissingSamples
then you can add this flag to fill in the table with just 0’s for all metrics--sampleToMixture
--newColumnElement
will provide the actual values, e.g. --newColumnName Program,Technology --newColumnElement SeekDeep,Illumina
etcSeekDeep benchmarkTarAmpControlMixtures
This is for benchmarking a singular amplicon target analysis, see below
SeekDeep
outputs:
--popSeqsFnp
is provided--popSeqsFnp
, fasta record names should match --popHapIdColNamestrain
column in the --mixtureSetUp table, if a strain is completely missing a target a stand fasta record can be put in and it’s sequence as all Ns, this will indicate that mixtures with this strain won’t ever be able to detect strain given there’s no sequence to detectExamples
SeekDeep benchmarkTarAmpControlMixtures --resultsFnp analysis/selectedClustersInfo.tab.txt.gz --expectedSeqsFnp ../refSeqsTrimmed/t9_split.fasta --name t9 --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir
SeekDeep benchmarkTarAmpControlMixtures --resultsFnp analysis/selectedClustersInfo.tab.txt.gz --expectedSeqsFnp ../refSeqsTrimmed/t9_split.fasta --name t9 --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples --newColumnElement SeekDeep --newColumnName Program --metaFnp ../misc/paragon_sampleMeta.tab.txt
SeekDeep benchmarkMultiTarAmpControlMixtures
Very similar to SeekDeep benchmarkTarAmpControlMixtures
, the only difference are that there is an additional required column to indicate target and the input expected/observed inputs are directories within which should be a fasta file with TARGET_NAME.fasta
SeekDeep
outputs:
--popSeqsDirFnp
is provided--popSeqsDirFnp
, fasta files should be named TARGET_NAME.fasta and fasta record names should match --popHapIdColNamestrain
column in the --mixtureSetUp table, if a strain is completely missing a target a stand fasta record can be put in and it’s sequence as all Ns, this will indicate that mixtures with this strain won’t ever be able to detect strain given there’s no sequence to detectCan also set up different mixture expectation per target for when the mixture setup is more complex than above with simple lab strains, this is done by adding the --targetNameColName to the --sampleToMixture table and --mixtureSetUp table, like below:
allFieldMixtureInfo = readr::read_tsv("benchmarking/allFieldMixtureInfo.tsv")
create_dt(allFieldMixtureInfo)
allFieldSampNameToMixName = readr::read_tsv("benchmarking/allFieldMixtureInfo.tsv")
create_dt(allFieldSampNameToMixName)
Examples
SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir
SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples
SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples --newColumnElement SeekDeep --newColumnName Program
All three programs will have the same output.
Output files:
columns explained below
--skipMissingSamples
this will have the samples that were missing--sampleToMixture
input, copied over to help with downstream analysis--mixtureSetUp
input for the mixtures found within --sampleToMixture
Summation of performance each sample per target.
recoveredExpectedHaps + falseHaps
)This breaks down the performance per haplotype within in sample.
A break down of all the haplotypes that didn’t match expected sequences compare to the expected sequences (will have a comparison for each expected so each haplotype has multiple rows)
mismatches + oneBaseIndels + twoBaseIndels + largeIndels
)Similar to falseHaplotypesComparedToExpected.tsv but instead of comparing to the expected haplotypes, it compares to the rest of the haplotypes within the same sample. this can be helpful to see if there are a bunch of 1-off from the major haplotype within the sample
mismatches + oneBaseIndels + twoBaseIndels + largeIndels
)
---
title: "SeekDeep control mixture benchmarking"
---
<script>
$(document).ready(function() {
document.querySelectorAll('.downloadLink').forEach(function(e) { e.setAttribute('download', e.text); });
document.querySelectorAll('.downloadLink').forEach(function(e) { e.innerHTML = '<i class="fa fa-download"></i> ' + e.text; });
});
</script>
```{r setup, echo=FALSE, message=FALSE}
source("../common.R")
```
# Utilites for benchmarking performance on control mixtures
When running targeted amplicon analysis it's helpful to use known control mixtures with known expected sequences and frequencies in order to test different programs, settings, lab experiments designs, etc. For this purpose several utilities were created and added to `SeekDeep` to help evaluate performance on several common metrics which will be explained below.
The three utilities, which only different slightly in input are:
1. `SeekDeep benchmarkTarAmpControlMixtures` - Benchmark a single amplicon target
1. `SeekDeep benchmarkMultiTarAmpControlMixtures` - Benchmark several amplicon targets at once
1. `SeekDeep benchmarkControlMixturesOnProcessedClustersDir` - Benchmark on the output of the `SeekDeep processClusters` results dir, which can be either single or multiple target
## Shared flags
### Required
Requirement to put in a file that specifices which samples are control mixtures and what mixtures they contain, and a separator file that contains what strains and at what relative abundance they are in the mixture.
* **\-\-sampleToMixture** - Sample To Mixture, 2 columns 1)sample, 2)MixName
* **\-\-mixtureSetUp** - Mixture Set Up, 3 columns 1)MixName, 2)strain, 3)relative_abundance
* The relative_abundance column can be any numbers and the frequencies will be calculated based off the sum, e.g. if there were 3 strains each with a relative abundance of 1 in this column then the expected frequencies would be 33.3% for all three strains,
#### --sampleToMixture example
```{r}
sampleToMixture = readr::read_tsv("benchmarking/samplesToMixFnp.tsv")
create_dt(sampleToMixture)
```
```{r}
#| results: asis
#| echo: false
cat(createDownloadLink("benchmarking/samplesToMixFnp.tsv"))
```
#### --mixtureSetUp example
```{r}
mixSetUp = readr::read_tsv("benchmarking/mixSetUpFnp.tsv")
create_dt(mixSetUp)
```
```{r}
#| results: asis
#| echo: false
cat(createDownloadLink("benchmarking/mixSetUpFnp.tsv"))
```
### Optional
* **\-\-skipMissingSamples** - By default if a control sample is missing the program will error out in case leaving out the control sample was a mistake, but sometimes a sample doesn't get enough reads in analysis so might be missing for that reason, for those cases can use this flag to simply skip missing samples rather than erroring out
* **\-\-fillInMissingSamples** - If using `--skipMissingSamples` then you can add this flag to fill in the table with just 0's for all metrics
* **\-\-metaFnp** - you can optionally add any associated meta data with the same to the output, input should be a tab deliminted file with at least 1 column titled **sample** and this column's values match up with the sample column in `--sampleToMixture`
* **\-\-newColumnName** - can optionally add one column value to all outputs, can be multiple values separated by commas, this will be the column names, `--newColumnElement` will provide the actual values, e.g. `--newColumnName Program,Technology --newColumnElement SeekDeep,Illumina` etc
* **\-\-newColumnElement** - The elements to fill the new columns if adding any via --newColumnName
## `SeekDeep benchmarkTarAmpControlMixtures` {#SeekDeep-benchmarkTarAmpControlMixtures}
This is for benchmarking a singular amplicon target analysis, see [below](#SeekDeep-benchmarkMultiTarAmpControlMixtures)
### Required
* **\-\-resultsFnp** - results tab delimited file, each row is a haplotype, should have at least 4 columns, 1) sample (--sampleColName), 2)within sample freq (--withinSampleFreqColName), 3)within sample read count (--withinSampleReadCntColName), 4)haplotype pop ID (--popHapIdColName), optionally 4th col with hap sequence (--popHapSeqColName) or read in from --popSeqsFnp
* Column names in the results file can be controlled by the following flags, the defaults are common columns used in `SeekDeep` outputs:
* **\-\-sampleColName** - The column name that contains the sample names (default:s_Sample)
* **\-\-withinSampleFreqColName** - The column name that contains the within sample frequency for the haplotype (should be between 0-1) (default:c_AveragedFrac)
* **\-\-withinSampleReadCntColName** - The column name that contains the within sample read count for the haplotype, (can be any numeric value) (default:c_ReadCnt)
* **\-\-popHapIdColName** - The column name that contains the unique population identifier for this haplotype (default:h_popUID)
* **\-\-popHapSeqColName** - The column name that contains sequence of the haplotype (default:h_popUID), this will be ignored if `--popSeqsFnp` is provided
* **\-\-popSeqsFnp** - if the sequences for the results aren't within the file they can be added by pointing to a fasta file via `--popSeqsFnp`, fasta record names should match **\-\-popHapIdColName**
* **\-\-expectedSeqsFnp** - A fasta file with the expected haplotype for this target, the name of the fasta record should match the `strain` column in the **\-\-mixtureSetUp** table, if a strain is completely missing a target a stand fasta record can be put in and it's sequence as all **N**s, this will indicate that mixtures with this strain won't ever be able to detect strain given there's no sequence to detect
* **\-\-name** - The name to be given to this analysis
Examples
```{bash, eval = F}
SeekDeep benchmarkTarAmpControlMixtures --resultsFnp analysis/selectedClustersInfo.tab.txt.gz --expectedSeqsFnp ../refSeqsTrimmed/t9_split.fasta --name t9 --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir
SeekDeep benchmarkTarAmpControlMixtures --resultsFnp analysis/selectedClustersInfo.tab.txt.gz --expectedSeqsFnp ../refSeqsTrimmed/t9_split.fasta --name t9 --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples --newColumnElement SeekDeep --newColumnName Program --metaFnp ../misc/paragon_sampleMeta.tab.txt
```
## `SeekDeep benchmarkMultiTarAmpControlMixtures` {#SeekDeep-benchmarkMultiTarAmpControlMixtures}
Very similar to [`SeekDeep benchmarkTarAmpControlMixtures`](#SeekDeep-benchmarkTarAmpControlMixtures), the only difference are that there is an additional required column to indicate target and the input expected/observed inputs are directories within which should be a fasta file with TARGET_NAME.fasta
* **\-\-resultsFnp** - results tab delimited file, each row is a haplotype, should have at least 5 columns, 1) sample (--sampleColName), 2)within sample freq (--withinSampleFreqColName), 3)within sample read count (--withinSampleReadCntColName), 4)haplotype pop ID (--popHapIdColName), 5)target name column (--targetNameColName), optionally 4th col with hap sequence (--popHapSeqColName) or read in from --popSeqsDirFnp
* Column names in the results file can be controlled by the following flags, the defaults are common columns used in `SeekDeep` outputs:
* **\-\-sampleColName** - The column name that contains the sample names (default:s_Sample)
* **\-\-withinSampleFreqColName** - The column name that contains the within sample frequency for the haplotype (should be between 0-1) (default:c_AveragedFrac)
* **\-\-withinSampleReadCntColName** - The column name that contains the within sample read count for the haplotype, (can be any numeric value) (default:c_ReadCnt)
* **\-\-popHapIdColName** - The column name that contains the unique population identifier for this haplotype (default:h_popUID)
* **\-\-targetNameColName** - The column name that contains the target amplicon name (default:p_name)
* **\-\-popHapSeqColName** - The column name that contains sequence of the haplotype (default:h_popUID), this will be ignored if `--popSeqsDirFnp` is provided
* **\-\-popSeqsDirFnp** - if the sequences for the results aren't within the results file they can be added by pointing to a directory of fasta files via `--popSeqsDirFnp`, fasta files should be named TARGET_NAME.fasta and fasta record names should match **\-\-popHapIdColName**
* **\-\-expectedSeqsDirFnp** - A directory of fasta files with the expected haplotype for each target, named TARGET_NAME.fasta, the name of the fasta record should match the `strain` column in the **\-\-mixtureSetUp** table, if a strain is completely missing a target a stand fasta record can be put in and it's sequence as all **N**s, this will indicate that mixtures with this strain won't ever be able to detect strain given there's no sequence to detect
Can also set up different mixture expectation per target for when the mixture setup is more complex than above with simple lab strains, this is done by adding the **\-\-targetNameColName** to the **\-\-sampleToMixture** table and **\-\-mixtureSetUp** table, like below:
```{r}
allFieldMixtureInfo = readr::read_tsv("benchmarking/allFieldMixtureInfo.tsv")
create_dt(allFieldMixtureInfo)
allFieldSampNameToMixName = readr::read_tsv("benchmarking/allFieldMixtureInfo.tsv")
create_dt(allFieldSampNameToMixName)
```
```{r}
#| results: asis
#| echo: false
cat(createDownloadLink("benchmarking/allFieldMixtureInfo.tsv"))
cat(" ")
cat(createDownloadLink("benchmarking/allFieldMixtureInfo.tsv"))
```
Examples
```{bash, eval = F}
SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir
SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples
SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples --newColumnElement SeekDeep --newColumnName Program
```
# Outputs
All three programs will have the same output.
Output files:
columns explained below
* performancePerTarget.tsv - Sum of performances per sample per target,
* classifiedHaplotypes.tsv - Each haplotype in the control samples classified as either matching expected or being a false haplotype, and associated information
* falseHaplotypesComparedToExpected.tsv - All the false haplotypes compare to the expected haplotypes with information about how many errors off from expected it is
* falseHaplotypesComparedToOthers.tsv - All the false haplotypes compare to the all other haplotypes for this target in the population with information about how closely it matches and what the difference in frequencies is
* missingSamples.txt - If `--skipMissingSamples` this will have the samples that were missing
* samplesToMix.tsv - This is a copy of the input `--sampleToMixture` input, copied over to help with downstream analysis
* mixSetUps.tsv - This has the information on the mixtures `--mixtureSetUp` input for the mixtures found within `--sampleToMixture`
* process.qmd - A quarto document will some R code for some basic plots of the performance reported as a starting point to modify to add more specific code
## performancePerTarget.tsv
Summation of performance each sample per target.
* **AnalysisName** - The name of the analysis, most commonly the target amplicon name
* **sample** - The name of the control sample
* **mix** - The name of the mixture for this sample
* **totalReads** - The total number of reads for this sample
* **recoveredExpectedHaps** - The number of haplotypes for this sample that matched an expected haplotype
* **falseHaps** - The number of haplotypes for this samples that don't match an expected haplotype
* **totalHaps** - The total number of final haplotypes for this sample (`recoveredExpectedHaps + falseHaps`)
* **totalExpectedHaps** - The total number of haplotypes that were expected for the mixture for this sample
* **expectedHapRecoveryRate** - The recovery rate of the expected haplotypes (calculated **recoveredExpectedHaps**/**totalExpectedHaps**)
* **falseHapsRate** - The rate of false haplotypes for this sample (calculated **falseHaps**/**totalHaps**)
* **expectedMissing** - If there are any expected haplotypes that are missing, will be reported here, semi-colon(;) separated
* **RMSE** - The root mean squared error for this sample, (calculated by taking the sum of squares of the differences of expected frequencies vs the observed frequencies of the recovered haplotypes and then taking the square root of the mean of that sum), the closer to zero the closer overall the mixture's expected frequencies were predicted[explanation here](https://en.wikipedia.org/wiki/Root-mean-square_deviation)
## classifiedHaplotypes.tsv
This breaks down the performance per haplotype within in sample.
* **AnalysisName** - The name of the analysis, most commonly the target amplicon name
* **sample** - The name of the control sample
* **mix** - The name of the mixture for this sample
* **InputReadName** - The name of the haplotype
* **HapPopUID** - The population unique identify associated with this haplotype
* **HapSampleCount** - The number of samples this **HapPopUID** appears in (can be helpful to see if it's in one sample only or appears in multiple samples)
* **readCnt** - The number of reads for this **InputReadName**
* **frac** - The within sample frequency of this **InputReadName**
* **matchExpected** - **true** if matches expected or **false** if it doesn't
* **expectedRef** - The number of the matching expected haplotype if **matchExpected**==true
* **expectedFrac** - The expected frequency of the expected haplotype
* **MajorOrMinor** - Whether or not the expected haplotype was the Major haplotype in the mixture (if it had the highest relative abundance, haps with same abundance that are both max are both labeled as major haplotype) or a Minor haplotype
## falseHaplotypesComparedToExpected.tsv
A break down of all the haplotypes that didn't match expected sequences compare to the expected sequences (will have a comparison for each expected so each haplotype has multiple rows)
* **AnalysisName** - The name of the analysis, most commonly the target amplicon name
* **sample** - The name of the control sample
* **mix** - The name of the mixture for this sample
* **InputReadName** - The name of the haplotype
* **readCnt** - The number of reads for this **InputReadName**
* **frac** - The within sample frequency of this **InputReadName**
* **RefName** - The name of the expected haplotype being compared to
* **ExpectedRefFreq** - Frequency of the expected haplotype being compared to
* **ExpectedMajorOrMinor** - Whether the expected haplotype being compared to is the Major or Minor haplotype within the mixture
* **IdentityScore** - The percent identity score to the expected haplotype being compared to
* **bestMatchScore** - true/false for whether this identity score is the best match
* **mismatches** - The number of mismatches
* **oneBaseIndels** - The number of 1 base indels
* **twoBaseIndels** - The number of 2 base indels
* **largeIndels** - The number of >3 base indels
* **totalErrors** - The total number of errors (`mismatches + oneBaseIndels + twoBaseIndels + largeIndels`)
## falseHaplotypesComparedToExpected.tsv
Similar to **falseHaplotypesComparedToExpected.tsv** but instead of comparing to the expected haplotypes, it compares to the rest of the haplotypes within the same sample. this can be helpful to see if there are a bunch of 1-off from the major haplotype within the sample
* **AnalysisName** - The name of the analysis, most commonly the target amplicon name
* **sample** - The name of the control sample
* **mix** - The name of the mixture for this sample
* **InputReadName** - The name of the haplotype
* **readCnt** - The number of reads for this **InputReadName**
* **frac** - The within sample frequency of this **InputReadName**
* **OtherName** - The name of the other haplotype within the sample being compared to
* **OtherReadCnt** - The read count of the other haplotype within the sample being compared to
* **OtherFrac** - The frequency of the other haplotype within the sample being compared to
* **ratio** - The ratio in frequencies between the two haplotypes being compared (calculated **OtherFrac**/**frac**)
* **OtherMajor** - Whether or not the other haplotype is the major haplotype in the sample (e.g. has the highest frequency )
* **OtherMatchesExpected** - Whether the other haplotype matches an expected haplotype
* **OtherExpectedMatchName** - If the other haplotype matches an expected haplotype
* **IdentityScore** - The percent identity score to the expected haplotype being compared to
* **bestMatchScore** - true/false for whether this identity score is the best match
* **mismatches** - The number of mismatches
* **oneBaseIndels** - The number of 1 base indels
* **twoBaseIndels** - The number of 2 base indels
* **largeIndels** - The number of >3 base indels
* **totalErrors** - The total number of errors (`mismatches + oneBaseIndels + twoBaseIndels + largeIndels`)
{{< fa dna >}}