SeekDeep control mixture benchmarking

Utilites for benchmarking performance on control mixtures

When running targeted amplicon analysis it’s helpful to use known control mixtures with known expected sequences and frequencies in order to test different programs, settings, lab experiments designs, etc. For this purpose several utilities were created and added to SeekDeep to help evaluate performance on several common metrics which will be explained below.

The three utilities, which only different slightly in input are:

SeekDeep benchmarkTarAmpControlMixtures - Benchmark a single amplicon target
SeekDeep benchmarkMultiTarAmpControlMixtures - Benchmark several amplicon targets at once
SeekDeep benchmarkControlMixturesOnProcessedClustersDir - Benchmark on the output of the SeekDeep processClusters results dir, which can be either single or multiple target

Shared flags

Required

Requirement to put in a file that specifices which samples are control mixtures and what mixtures they contain, and a separator file that contains what strains and at what relative abundance they are in the mixture.

--sampleToMixture - Sample To Mixture, 2 columns 1)sample, 2)MixName
--mixtureSetUp - Mixture Set Up, 3 columns 1)MixName, 2)strain, 3)relative_abundance
- The relative_abundance column can be any numbers and the frequencies will be calculated based off the sum, e.g. if there were 3 strains each with a relative abundance of 1 in this column then the expected frequencies would be 33.3% for all three strains,

–sampleToMixture example

Code

sampleToMixture = readr::read_tsv("benchmarking/samplesToMixFnp.tsv")
create_dt(sampleToMixture)

samplesToMixFnp.tsv

–mixtureSetUp example

Code

mixSetUp = readr::read_tsv("benchmarking/mixSetUpFnp.tsv")
create_dt(mixSetUp)

mixSetUpFnp.tsv

Optional

--skipMissingSamples - By default if a control sample is missing the program will error out in case leaving out the control sample was a mistake, but sometimes a sample doesn’t get enough reads in analysis so might be missing for that reason, for those cases can use this flag to simply skip missing samples rather than erroring out
--fillInMissingSamples - If using --skipMissingSamples then you can add this flag to fill in the table with just 0’s for all metrics
--metaFnp - you can optionally add any associated meta data with the same to the output, input should be a tab deliminted file with at least 1 column titled sample and this column’s values match up with the sample column in --sampleToMixture
--newColumnName - can optionally add one column value to all outputs, can be multiple values separated by commas, this will be the column names, --newColumnElement will provide the actual values, e.g. --newColumnName Program,Technology --newColumnElement SeekDeep,Illumina etc
--newColumnElement - The elements to fill the new columns if adding any via –newColumnName

`SeekDeep benchmarkTarAmpControlMixtures`

This is for benchmarking a singular amplicon target analysis, see below

Required

--resultsFnp - results tab delimited file, each row is a haplotype, should have at least 4 columns, 1) sample (–sampleColName), 2)within sample freq (–withinSampleFreqColName), 3)within sample read count (–withinSampleReadCntColName), 4)haplotype pop ID (–popHapIdColName), optionally 4th col with hap sequence (–popHapSeqColName) or read in from –popSeqsFnp
- Column names in the results file can be controlled by the following flags, the defaults are common columns used in SeekDeep outputs:
  - --sampleColName - The column name that contains the sample names (default:s_Sample)
  - --withinSampleFreqColName - The column name that contains the within sample frequency for the haplotype (should be between 0-1) (default:c_AveragedFrac)
  - --withinSampleReadCntColName - The column name that contains the within sample read count for the haplotype, (can be any numeric value) (default:c_ReadCnt)
  - --popHapIdColName - The column name that contains the unique population identifier for this haplotype (default:h_popUID)
  - --popHapSeqColName - The column name that contains sequence of the haplotype (default:h_popUID), this will be ignored if --popSeqsFnp is provided
- --popSeqsFnp - if the sequences for the results aren’t within the file they can be added by pointing to a fasta file via --popSeqsFnp, fasta record names should match --popHapIdColName
--expectedSeqsFnp - A fasta file with the expected haplotype for this target, the name of the fasta record should match the strain column in the --mixtureSetUp table, if a strain is completely missing a target a stand fasta record can be put in and it’s sequence as all Ns, this will indicate that mixtures with this strain won’t ever be able to detect strain given there’s no sequence to detect
--name - The name to be given to this analysis

Examples

Code


SeekDeep benchmarkTarAmpControlMixtures --resultsFnp analysis/selectedClustersInfo.tab.txt.gz --expectedSeqsFnp ../refSeqsTrimmed/t9_split.fasta --name t9 --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir 

SeekDeep benchmarkTarAmpControlMixtures --resultsFnp analysis/selectedClustersInfo.tab.txt.gz --expectedSeqsFnp ../refSeqsTrimmed/t9_split.fasta --name t9 --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples --newColumnElement SeekDeep --newColumnName Program --metaFnp ../misc/paragon_sampleMeta.tab.txt

`SeekDeep benchmarkMultiTarAmpControlMixtures`

Very similar to SeekDeep benchmarkTarAmpControlMixtures, the only difference are that there is an additional required column to indicate target and the input expected/observed inputs are directories within which should be a fasta file with TARGET_NAME.fasta

--resultsFnp - results tab delimited file, each row is a haplotype, should have at least 5 columns, 1) sample (–sampleColName), 2)within sample freq (–withinSampleFreqColName), 3)within sample read count (–withinSampleReadCntColName), 4)haplotype pop ID (–popHapIdColName), 5)target name column (–targetNameColName), optionally 4th col with hap sequence (–popHapSeqColName) or read in from –popSeqsDirFnp
- Column names in the results file can be controlled by the following flags, the defaults are common columns used in SeekDeep outputs:
  - --sampleColName - The column name that contains the sample names (default:s_Sample)
  - --withinSampleFreqColName - The column name that contains the within sample frequency for the haplotype (should be between 0-1) (default:c_AveragedFrac)
  - --withinSampleReadCntColName - The column name that contains the within sample read count for the haplotype, (can be any numeric value) (default:c_ReadCnt)
  - --popHapIdColName - The column name that contains the unique population identifier for this haplotype (default:h_popUID)
  - --targetNameColName - The column name that contains the target amplicon name (default:p_name)
  - --popHapSeqColName - The column name that contains sequence of the haplotype (default:h_popUID), this will be ignored if --popSeqsDirFnp is provided
- --popSeqsDirFnp - if the sequences for the results aren’t within the results file they can be added by pointing to a directory of fasta files via --popSeqsDirFnp, fasta files should be named TARGET_NAME.fasta and fasta record names should match --popHapIdColName
--expectedSeqsDirFnp - A directory of fasta files with the expected haplotype for each target, named TARGET_NAME.fasta, the name of the fasta record should match the strain column in the --mixtureSetUp table, if a strain is completely missing a target a stand fasta record can be put in and it’s sequence as all Ns, this will indicate that mixtures with this strain won’t ever be able to detect strain given there’s no sequence to detect

Can also set up different mixture expectation per target for when the mixture setup is more complex than above with simple lab strains, this is done by adding the --targetNameColName to the --sampleToMixture table and --mixtureSetUp table, like below:

Code

allFieldMixtureInfo = readr::read_tsv("benchmarking/allFieldMixtureInfo.tsv")
create_dt(allFieldMixtureInfo)

Code

allFieldSampNameToMixName = readr::read_tsv("benchmarking/allFieldMixtureInfo.tsv")
create_dt(allFieldSampNameToMixName)

allFieldMixtureInfo.tsv allFieldMixtureInfo.tsv

Examples

Code


SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir 

SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples

SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples --newColumnElement SeekDeep --newColumnName Program

Outputs

All three programs will have the same output.

Output files:
columns explained below

performancePerTarget.tsv - Sum of performances per sample per target,
classifiedHaplotypes.tsv - Each haplotype in the control samples classified as either matching expected or being a false haplotype, and associated information
falseHaplotypesComparedToExpected.tsv - All the false haplotypes compare to the expected haplotypes with information about how many errors off from expected it is
falseHaplotypesComparedToOthers.tsv - All the false haplotypes compare to the all other haplotypes for this target in the population with information about how closely it matches and what the difference in frequencies is
missingSamples.txt - If --skipMissingSamples this will have the samples that were missing
samplesToMix.tsv - This is a copy of the input --sampleToMixture input, copied over to help with downstream analysis
mixSetUps.tsv - This has the information on the mixtures --mixtureSetUp input for the mixtures found within --sampleToMixture
process.qmd - A quarto document will some R code for some basic plots of the performance reported as a starting point to modify to add more specific code

performancePerTarget.tsv

Summation of performance each sample per target.

AnalysisName - The name of the analysis, most commonly the target amplicon name
sample - The name of the control sample
mix - The name of the mixture for this sample
totalReads - The total number of reads for this sample
recoveredExpectedHaps - The number of haplotypes for this sample that matched an expected haplotype
falseHaps - The number of haplotypes for this samples that don’t match an expected haplotype
totalHaps - The total number of final haplotypes for this sample (recoveredExpectedHaps + falseHaps)
totalExpectedHaps - The total number of haplotypes that were expected for the mixture for this sample
expectedHapRecoveryRate - The recovery rate of the expected haplotypes (calculated recoveredExpectedHaps/totalExpectedHaps)
falseHapsRate - The rate of false haplotypes for this sample (calculated falseHaps/totalHaps)
expectedMissing - If there are any expected haplotypes that are missing, will be reported here, semi-colon(;) separated
RMSE - The root mean squared error for this sample, (calculated by taking the sum of squares of the differences of expected frequencies vs the observed frequencies of the recovered haplotypes and then taking the square root of the mean of that sum), the closer to zero the closer overall the mixture’s expected frequencies were predictedexplanation here

classifiedHaplotypes.tsv

This breaks down the performance per haplotype within in sample.

AnalysisName - The name of the analysis, most commonly the target amplicon name
sample - The name of the control sample
mix - The name of the mixture for this sample
InputReadName - The name of the haplotype
HapPopUID - The population unique identify associated with this haplotype
HapSampleCount - The number of samples this HapPopUID appears in (can be helpful to see if it’s in one sample only or appears in multiple samples)
readCnt - The number of reads for this InputReadName
frac - The within sample frequency of this InputReadName
matchExpected - true if matches expected or false if it doesn’t
expectedRef - The number of the matching expected haplotype if matchExpected==true
expectedFrac - The expected frequency of the expected haplotype
MajorOrMinor - Whether or not the expected haplotype was the Major haplotype in the mixture (if it had the highest relative abundance, haps with same abundance that are both max are both labeled as major haplotype) or a Minor haplotype

falseHaplotypesComparedToExpected.tsv

A break down of all the haplotypes that didn’t match expected sequences compare to the expected sequences (will have a comparison for each expected so each haplotype has multiple rows)

AnalysisName - The name of the analysis, most commonly the target amplicon name
sample - The name of the control sample
mix - The name of the mixture for this sample
InputReadName - The name of the haplotype
readCnt - The number of reads for this InputReadName
frac - The within sample frequency of this InputReadName
RefName - The name of the expected haplotype being compared to
ExpectedRefFreq - Frequency of the expected haplotype being compared to
ExpectedMajorOrMinor - Whether the expected haplotype being compared to is the Major or Minor haplotype within the mixture
IdentityScore - The percent identity score to the expected haplotype being compared to
bestMatchScore - true/false for whether this identity score is the best match
mismatches - The number of mismatches
oneBaseIndels - The number of 1 base indels
twoBaseIndels - The number of 2 base indels
largeIndels - The number of >3 base indels
totalErrors - The total number of errors (mismatches + oneBaseIndels + twoBaseIndels + largeIndels)

falseHaplotypesComparedToExpected.tsv

Similar to falseHaplotypesComparedToExpected.tsv but instead of comparing to the expected haplotypes, it compares to the rest of the haplotypes within the same sample. this can be helpful to see if there are a bunch of 1-off from the major haplotype within the sample

AnalysisName - The name of the analysis, most commonly the target amplicon name
sample - The name of the control sample
mix - The name of the mixture for this sample
InputReadName - The name of the haplotype
readCnt - The number of reads for this InputReadName
frac - The within sample frequency of this InputReadName
OtherName - The name of the other haplotype within the sample being compared to
OtherReadCnt - The read count of the other haplotype within the sample being compared to
OtherFrac - The frequency of the other haplotype within the sample being compared to
ratio - The ratio in frequencies between the two haplotypes being compared (calculated OtherFrac/frac)
OtherMajor - Whether or not the other haplotype is the major haplotype in the sample (e.g. has the highest frequency )
OtherMatchesExpected - Whether the other haplotype matches an expected haplotype
OtherExpectedMatchName - If the other haplotype matches an expected haplotype
IdentityScore - The percent identity score to the expected haplotype being compared to
bestMatchScore - true/false for whether this identity score is the best match
mismatches - The number of mismatches
oneBaseIndels - The number of 1 base indels
twoBaseIndels - The number of 2 base indels
largeIndels - The number of >3 base indels
totalErrors - The total number of errors (mismatches + oneBaseIndels + twoBaseIndels + largeIndels)

--- title: "SeekDeep control mixture benchmarking" --- <script> $(document).ready(function() { document.querySelectorAll('.downloadLink').forEach(function(e) { e.setAttribute('download', e.text); }); document.querySelectorAll('.downloadLink').forEach(function(e) { e.innerHTML = '<i class="fa fa-download"></i> ' + e.text; }); }); </script> ```{r setup, echo=FALSE, message=FALSE} source("../common.R") ``` # Utilites for benchmarking performance on control mixtures When running targeted amplicon analysis it's helpful to use known control mixtures with known expected sequences and frequencies in order to test different programs, settings, lab experiments designs, etc. For this purpose several utilities were created and added to `SeekDeep` to help evaluate performance on several common metrics which will be explained below. The three utilities, which only different slightly in input are: 1. `SeekDeep benchmarkTarAmpControlMixtures` - Benchmark a single amplicon target 1. `SeekDeep benchmarkMultiTarAmpControlMixtures` - Benchmark several amplicon targets at once 1. `SeekDeep benchmarkControlMixturesOnProcessedClustersDir` - Benchmark on the output of the `SeekDeep processClusters` results dir, which can be either single or multiple target ## Shared flags ### Required Requirement to put in a file that specifices which samples are control mixtures and what mixtures they contain, and a separator file that contains what strains and at what relative abundance they are in the mixture. * **\-\-sampleToMixture** - Sample To Mixture, 2 columns 1)sample, 2)MixName * **\-\-mixtureSetUp** - Mixture Set Up, 3 columns 1)MixName, 2)strain, 3)relative_abundance * The relative_abundance column can be any numbers and the frequencies will be calculated based off the sum, e.g. if there were 3 strains each with a relative abundance of 1 in this column then the expected frequencies would be 33.3% for all three strains, #### --sampleToMixture example ```{r} sampleToMixture = readr::read_tsv("benchmarking/samplesToMixFnp.tsv") create_dt(sampleToMixture) ``` ```{r} #| results: asis #| echo: false cat(createDownloadLink("benchmarking/samplesToMixFnp.tsv")) ``` #### --mixtureSetUp example ```{r} mixSetUp = readr::read_tsv("benchmarking/mixSetUpFnp.tsv") create_dt(mixSetUp) ``` ```{r} #| results: asis #| echo: false cat(createDownloadLink("benchmarking/mixSetUpFnp.tsv")) ``` ### Optional * **\-\-skipMissingSamples** - By default if a control sample is missing the program will error out in case leaving out the control sample was a mistake, but sometimes a sample doesn't get enough reads in analysis so might be missing for that reason, for those cases can use this flag to simply skip missing samples rather than erroring out * **\-\-fillInMissingSamples** - If using `--skipMissingSamples` then you can add this flag to fill in the table with just 0's for all metrics * **\-\-metaFnp** - you can optionally add any associated meta data with the same to the output, input should be a tab deliminted file with at least 1 column titled **sample** and this column's values match up with the sample column in `--sampleToMixture` * **\-\-newColumnName** - can optionally add one column value to all outputs, can be multiple values separated by commas, this will be the column names, `--newColumnElement` will provide the actual values, e.g. `--newColumnName Program,Technology --newColumnElement SeekDeep,Illumina` etc * **\-\-newColumnElement** - The elements to fill the new columns if adding any via --newColumnName ## `SeekDeep benchmarkTarAmpControlMixtures` {#SeekDeep-benchmarkTarAmpControlMixtures} This is for benchmarking a singular amplicon target analysis, see [below](#SeekDeep-benchmarkMultiTarAmpControlMixtures) ### Required * **\-\-resultsFnp** - results tab delimited file, each row is a haplotype, should have at least 4 columns, 1) sample (--sampleColName), 2)within sample freq (--withinSampleFreqColName), 3)within sample read count (--withinSampleReadCntColName), 4)haplotype pop ID (--popHapIdColName), optionally 4th col with hap sequence (--popHapSeqColName) or read in from --popSeqsFnp * Column names in the results file can be controlled by the following flags, the defaults are common columns used in `SeekDeep` outputs: * **\-\-sampleColName** - The column name that contains the sample names (default:s_Sample) * **\-\-withinSampleFreqColName** - The column name that contains the within sample frequency for the haplotype (should be between 0-1) (default:c_AveragedFrac) * **\-\-withinSampleReadCntColName** - The column name that contains the within sample read count for the haplotype, (can be any numeric value) (default:c_ReadCnt) * **\-\-popHapIdColName** - The column name that contains the unique population identifier for this haplotype (default:h_popUID) * **\-\-popHapSeqColName** - The column name that contains sequence of the haplotype (default:h_popUID), this will be ignored if `--popSeqsFnp` is provided * **\-\-popSeqsFnp** - if the sequences for the results aren't within the file they can be added by pointing to a fasta file via `--popSeqsFnp`, fasta record names should match **\-\-popHapIdColName** * **\-\-expectedSeqsFnp** - A fasta file with the expected haplotype for this target, the name of the fasta record should match the `strain` column in the **\-\-mixtureSetUp** table, if a strain is completely missing a target a stand fasta record can be put in and it's sequence as all **N**s, this will indicate that mixtures with this strain won't ever be able to detect strain given there's no sequence to detect * **\-\-name** - The name to be given to this analysis Examples ```{bash, eval = F} SeekDeep benchmarkTarAmpControlMixtures --resultsFnp analysis/selectedClustersInfo.tab.txt.gz --expectedSeqsFnp ../refSeqsTrimmed/t9_split.fasta --name t9 --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir SeekDeep benchmarkTarAmpControlMixtures --resultsFnp analysis/selectedClustersInfo.tab.txt.gz --expectedSeqsFnp ../refSeqsTrimmed/t9_split.fasta --name t9 --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples --newColumnElement SeekDeep --newColumnName Program --metaFnp ../misc/paragon_sampleMeta.tab.txt ``` ## `SeekDeep benchmarkMultiTarAmpControlMixtures` {#SeekDeep-benchmarkMultiTarAmpControlMixtures} Very similar to [`SeekDeep benchmarkTarAmpControlMixtures`](#SeekDeep-benchmarkTarAmpControlMixtures), the only difference are that there is an additional required column to indicate target and the input expected/observed inputs are directories within which should be a fasta file with TARGET_NAME.fasta * **\-\-resultsFnp** - results tab delimited file, each row is a haplotype, should have at least 5 columns, 1) sample (--sampleColName), 2)within sample freq (--withinSampleFreqColName), 3)within sample read count (--withinSampleReadCntColName), 4)haplotype pop ID (--popHapIdColName), 5)target name column (--targetNameColName), optionally 4th col with hap sequence (--popHapSeqColName) or read in from --popSeqsDirFnp * Column names in the results file can be controlled by the following flags, the defaults are common columns used in `SeekDeep` outputs: * **\-\-sampleColName** - The column name that contains the sample names (default:s_Sample) * **\-\-withinSampleFreqColName** - The column name that contains the within sample frequency for the haplotype (should be between 0-1) (default:c_AveragedFrac) * **\-\-withinSampleReadCntColName** - The column name that contains the within sample read count for the haplotype, (can be any numeric value) (default:c_ReadCnt) * **\-\-popHapIdColName** - The column name that contains the unique population identifier for this haplotype (default:h_popUID) * **\-\-targetNameColName** - The column name that contains the target amplicon name (default:p_name) * **\-\-popHapSeqColName** - The column name that contains sequence of the haplotype (default:h_popUID), this will be ignored if `--popSeqsDirFnp` is provided * **\-\-popSeqsDirFnp** - if the sequences for the results aren't within the results file they can be added by pointing to a directory of fasta files via `--popSeqsDirFnp`, fasta files should be named TARGET_NAME.fasta and fasta record names should match **\-\-popHapIdColName** * **\-\-expectedSeqsDirFnp** - A directory of fasta files with the expected haplotype for each target, named TARGET_NAME.fasta, the name of the fasta record should match the `strain` column in the **\-\-mixtureSetUp** table, if a strain is completely missing a target a stand fasta record can be put in and it's sequence as all **N**s, this will indicate that mixtures with this strain won't ever be able to detect strain given there's no sequence to detect Can also set up different mixture expectation per target for when the mixture setup is more complex than above with simple lab strains, this is done by adding the **\-\-targetNameColName** to the **\-\-sampleToMixture** table and **\-\-mixtureSetUp** table, like below: ```{r} allFieldMixtureInfo = readr::read_tsv("benchmarking/allFieldMixtureInfo.tsv") create_dt(allFieldMixtureInfo) allFieldSampNameToMixName = readr::read_tsv("benchmarking/allFieldMixtureInfo.tsv") create_dt(allFieldSampNameToMixName) ``` ```{r} #| results: asis #| echo: false cat(createDownloadLink("benchmarking/allFieldMixtureInfo.tsv")) cat(" ") cat(createDownloadLink("benchmarking/allFieldMixtureInfo.tsv")) ``` Examples ```{bash, eval = F} SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples --newColumnElement SeekDeep --newColumnName Program ``` # Outputs All three programs will have the same output. Output files: columns explained below * performancePerTarget.tsv - Sum of performances per sample per target, * classifiedHaplotypes.tsv - Each haplotype in the control samples classified as either matching expected or being a false haplotype, and associated information * falseHaplotypesComparedToExpected.tsv - All the false haplotypes compare to the expected haplotypes with information about how many errors off from expected it is * falseHaplotypesComparedToOthers.tsv - All the false haplotypes compare to the all other haplotypes for this target in the population with information about how closely it matches and what the difference in frequencies is * missingSamples.txt - If `--skipMissingSamples` this will have the samples that were missing * samplesToMix.tsv - This is a copy of the input `--sampleToMixture` input, copied over to help with downstream analysis * mixSetUps.tsv - This has the information on the mixtures `--mixtureSetUp` input for the mixtures found within `--sampleToMixture` * process.qmd - A quarto document will some R code for some basic plots of the performance reported as a starting point to modify to add more specific code ## performancePerTarget.tsv Summation of performance each sample per target. * **AnalysisName** - The name of the analysis, most commonly the target amplicon name * **sample** - The name of the control sample * **mix** - The name of the mixture for this sample * **totalReads** - The total number of reads for this sample * **recoveredExpectedHaps** - The number of haplotypes for this sample that matched an expected haplotype * **falseHaps** - The number of haplotypes for this samples that don't match an expected haplotype * **totalHaps** - The total number of final haplotypes for this sample (`recoveredExpectedHaps + falseHaps`) * **totalExpectedHaps** - The total number of haplotypes that were expected for the mixture for this sample * **expectedHapRecoveryRate** - The recovery rate of the expected haplotypes (calculated **recoveredExpectedHaps**/**totalExpectedHaps**) * **falseHapsRate** - The rate of false haplotypes for this sample (calculated **falseHaps**/**totalHaps**) * **expectedMissing** - If there are any expected haplotypes that are missing, will be reported here, semi-colon(;) separated * **RMSE** - The root mean squared error for this sample, (calculated by taking the sum of squares of the differences of expected frequencies vs the observed frequencies of the recovered haplotypes and then taking the square root of the mean of that sum), the closer to zero the closer overall the mixture's expected frequencies were predicted[explanation here](https://en.wikipedia.org/wiki/Root-mean-square_deviation) ## classifiedHaplotypes.tsv This breaks down the performance per haplotype within in sample. * **AnalysisName** - The name of the analysis, most commonly the target amplicon name * **sample** - The name of the control sample * **mix** - The name of the mixture for this sample * **InputReadName** - The name of the haplotype * **HapPopUID** - The population unique identify associated with this haplotype * **HapSampleCount** - The number of samples this **HapPopUID** appears in (can be helpful to see if it's in one sample only or appears in multiple samples) * **readCnt** - The number of reads for this **InputReadName** * **frac** - The within sample frequency of this **InputReadName** * **matchExpected** - **true** if matches expected or **false** if it doesn't * **expectedRef** - The number of the matching expected haplotype if **matchExpected**==true * **expectedFrac** - The expected frequency of the expected haplotype * **MajorOrMinor** - Whether or not the expected haplotype was the Major haplotype in the mixture (if it had the highest relative abundance, haps with same abundance that are both max are both labeled as major haplotype) or a Minor haplotype ## falseHaplotypesComparedToExpected.tsv A break down of all the haplotypes that didn't match expected sequences compare to the expected sequences (will have a comparison for each expected so each haplotype has multiple rows) * **AnalysisName** - The name of the analysis, most commonly the target amplicon name * **sample** - The name of the control sample * **mix** - The name of the mixture for this sample * **InputReadName** - The name of the haplotype * **readCnt** - The number of reads for this **InputReadName** * **frac** - The within sample frequency of this **InputReadName** * **RefName** - The name of the expected haplotype being compared to * **ExpectedRefFreq** - Frequency of the expected haplotype being compared to * **ExpectedMajorOrMinor** - Whether the expected haplotype being compared to is the Major or Minor haplotype within the mixture * **IdentityScore** - The percent identity score to the expected haplotype being compared to * **bestMatchScore** - true/false for whether this identity score is the best match * **mismatches** - The number of mismatches * **oneBaseIndels** - The number of 1 base indels * **twoBaseIndels** - The number of 2 base indels * **largeIndels** - The number of >3 base indels * **totalErrors** - The total number of errors (`mismatches + oneBaseIndels + twoBaseIndels + largeIndels`) ## falseHaplotypesComparedToExpected.tsv Similar to **falseHaplotypesComparedToExpected.tsv** but instead of comparing to the expected haplotypes, it compares to the rest of the haplotypes within the same sample. this can be helpful to see if there are a bunch of 1-off from the major haplotype within the sample * **AnalysisName** - The name of the analysis, most commonly the target amplicon name * **sample** - The name of the control sample * **mix** - The name of the mixture for this sample * **InputReadName** - The name of the haplotype * **readCnt** - The number of reads for this **InputReadName** * **frac** - The within sample frequency of this **InputReadName** * **OtherName** - The name of the other haplotype within the sample being compared to * **OtherReadCnt** - The read count of the other haplotype within the sample being compared to * **OtherFrac** - The frequency of the other haplotype within the sample being compared to * **ratio** - The ratio in frequencies between the two haplotypes being compared (calculated **OtherFrac**/**frac**) * **OtherMajor** - Whether or not the other haplotype is the major haplotype in the sample (e.g. has the highest frequency ) * **OtherMatchesExpected** - Whether the other haplotype matches an expected haplotype * **OtherExpectedMatchName** - If the other haplotype matches an expected haplotype * **IdentityScore** - The percent identity score to the expected haplotype being compared to * **bestMatchScore** - true/false for whether this identity score is the best match * **mismatches** - The number of mismatches * **oneBaseIndels** - The number of 1 base indels * **twoBaseIndels** - The number of 2 base indels * **largeIndels** - The number of >3 base indels * **totalErrors** - The total number of errors (`mismatches + oneBaseIndels + twoBaseIndels + largeIndels`) {{< fa dna >}}