Code
SeekDeep processClusters
processClusters takes care of several of the final steps of the SeekDeep pipeline.
Just typing the name of the program will give a help message on running the program
SeekDeep processClusters
Ran from inside directory tree set up such that currentDir/SampleDir/RunDir/seqFiles
Required Options
Clustering
--par ParametersFileName; required; default=; ([38;5;202mstd::string)
Reading Sequence Input
--fasta Input sequence filename, only need 1, fasta text file; required; default=; ([38;5;202mboost::filesystem::path)
--fastagz Input sequence filename, only need 1, fasta gzipped file; required; default=; ([38;5;202mboost::filesystem::path)
--fastq Input sequence filename, only need 1, fastq text file; required; default=; ([38;5;202mboost::filesystem::path)
--fastqgz Input sequence filename, only need 1, fastq gzipped file; required; default=; ([38;5;202mboost::filesystem::path)
Optional Options
Additional Output
--extra Extra Output; default=false; ([38;5;202mbool)
--writeExcludedOriginals Write out the excluded Originals; default=false; ([38;5;202mbool)
Alignment
...
...
Also calling -help
will do the same
Also all flags in SeekDeep are case insensitive and so all the following would have the same results
The input to processClusters is done by running it in a directory with several results files for each sample in subdirectories below the top directory. An example directory tree below
../extraFiles/sampleClusteringDir
├── Sample01
│ ├── MID01
│ │ └── output.fastq
│ └── MID02
│ └── output.fastq
├── Sample02
│ ├── MID03
│ │ └── output.fastq
│ └── MID04
│ └── output.fastq
├── Sample03
│ ├── MID05
│ │ └── output.fastq
│ └── MID06
│ └── output.fastq
└── Sample04
├── MID07
│ └── output.fastq
└── MID08
└── output.fastq
Each output file needs to be named the same (here output.fastq) and the sequences in the output should be named so that they have a suffix indicating how many reads are associated with each cluster. Suffix should be _[NUM] or _t[NUM] where [NUM] is the number of reads associated with each cluster.
An output directory will be created to store the output of processClusters, the name of which can be changed by using the -dout
flag. In this directory will be several files and directories.
All the data from the analysis done by processClusters is located in this file. It is a very large file and can be overwhelming at first but all data is conveniently located in one place. How the table is organized is each row represents a haplotype from a sample. Therefore each row will report stats on the haplotype that will have to do with how it was found in the replicates, what the final stats of the haplotype after replicate comparison was, and if population analysis is being done stats on how the haplotype appears in the population. At the same time general stats for each sample and replicate are reported and therefore several columns will have redundant information because there will be several haplotypes for each replicate and sample. Also to define some terms to make sense of the column heading. Each sample will have a one or more replicates, these are normally PCR replicates. Reads from these replicates will first be clustered separately and then compared to each other to make the final data for the sample. The results from this comparison will results in what will be called clusters and each cluster will have a relative abundance in the sample totaling up to 1. After clusters are finalized in the samples they are then the final clusters from each samples are compared to each other to create population data incorporating all samples. This population analysis will result will what will be called haplotypes. So while in the sample these sequences are referred to as clusters and in the population as haplotype to distinguish when talking about each one. Each column has a prefix to indicate what kind of information it being reported on. s_ indicates data about the sample, p_ indicates data about the whole population h_ indicates data about the haplotype, c_ indicates data about the cluster while in the sample and RXX indicates information in the replicates where XX is the replicate number.
processClusters needs only two options to run, name of the results files given by --fastq
or --fasta
in the directory tree (see above) and a parameters file --par
see qluster usage for explanation of parameters file. Alternately to the --par
you can use either --noErrors
to allow for now errors to collapse between comparisons or --strictErrors
to allow for just a few low quality mismatches.
Input reads can be fasta or fastq
If you want to give different parameters to do the population clustering with just use the ’-popPar` flag
To output additional files of final results organized by groups, supply a grouping tab delimited file where the first column has sample names and each additional column is named for a group with sub-groups that each sample belongs to. See below for an example file
samples age location
Sample01 child loc1
Sample02 child loc2
Sample03 adult loc1
Sample04 adult loc2
Just give this file name to the --groups
flag
This will create a directory called groups in the processClusters output directory. This will contain a series of sub directories of results files split by group and sub-group with a file called sampFile.tab.txt (set up like selectedClusters.tab.txt) and a file called popFile.tab.txt (set up like populationCluster.tab.txt) but will only contain information about that subgroup, the population UIDs will be the same so they can be linked to the main results data. This is still a new feature and will be developed further in future release when analysis between groups will automatically be done but for now the most interesting thing to look at would be the COI stats in each popFile.tab.txt between sub-groups in each group.
To supply a reference file to compare sequences to (this is helpful when you have controls with expected mixture done along with your experiment) use the --ref
or --refFastq
flag. The --ref
flag is for when the sequences are in a fasta file and --refFastq
is for when the seq file is a fastq file.
By default the final haplotypes are given the PopUID.XX but you can change the PopUID to be the name of your experiment
By default a relative abundance cut off of 0.005 (0.5%) is done to clear out low abundant clusters are most likely due to current technologies and PCR limitation and shouldn’t be done in the experiment. If you’re data is especially noisy or you want to be more stringent in your cut off use the --fracCutOff
flag, for example the filter off all clusters at 1% and lower use the below example
By default all singletons clusters are thrown out before analysis is done per replicate. To change this behavior use the --clusterCutOff
flag
By default all sample clusters that are made up of mostly clusters with the CHI_ flag are removed from analysis. To keep chimeras use the --keepChimera
flag. Also marking clusters can be done with SeekDeep qluster but can also be done by processClusters by using the same flags as qluster to make them, --markChimeras
and --parFreqs
see SeekDeep qluster for an explanation of these flags.
Also marking all sequences that appear as chimeric can be removing low level recombinants. To try to prevent some of this from happening you can use the --investigateChimeras
flag which will take clusters marked as chimeric and see if they are ever one of the two major clusters in an other sample. The reasoning behind this is that chimeric clusters need two parents and if a cluster appears as one of the two major clusters in a sample it’s not possible to have two parents that are greater than it is.
SeekDeep processClusters
also offers the clean up of low frequencies one-off haplotypes at the end of clustering. This can be especially helpful for when there are no replicates to help correct for PCR noise and there might be several haplotypes that are at low frequency and only 1 different from a major higher in frequency haplotypes (e.g. major haplotype at 95% and a haplotype 1 base different at 1%). This can be a good alternative to simply collapsing all one-off haplotypes that would collapse haplotypes both at high frequency like 50% and 50%.
The different in frequencies is set by default so that the higher frequency haplotype has to be found at 30x the frequency of the lower frequency haplotypes (this number was arrived at by empirical means by looking at various control mixtures). This number can be changed using --lowFreqMultiplier
SeekDeep processClusters
also offers the ability to compare the final results to a reference genome(–genomeFnp) and associated annotation file(–gffFnp) to create a VCF as well as translate final haplotypes based on annotation to also create a protein VCF as well. You can optionally supply a list of known drug associated mutations(–knownAminoAcidChangesFnp) to report the counts of these locations regardless if they are variant or not, an example of this type of file can be found here, https://seekdeep.brown.edu/data/plasmodiumData/genomes/pf/info/pf_drug_resistant_aaPositions.tsv, should have a column of geneID that can be found in –gffFnp and 1-based position of the amino acid of interest.
SeekDeep processClusters --strictErrors --dout analysis --fastqgz output.fastq.gz --overWriteDir --illumina --removeOneSampOnlyOneOffHaps --excludeCommonlyLowFreqHaplotypes --excludeLowFreqOneOffs --rescueExcludedOneOffLowFreqHaplotypes --genomeFnp /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --gffFnp /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv
The various variant calling results will be found in the output directory (above analysis) in a folder called variantCalling
:::{.callout-note}
# processClusters's objectives
processClusters takes care of several of the final steps of the SeekDeep pipeline.
* **replicate comparison** - If multiple replicates are available per sample it is able to take advantage of this and take only clusters that appear in all replicates
* **final results filtering** - A filter is applied to the final results where low frequency clusters and singlet clusters are removed. This step is applied to get rid of artifact sequence that couldn't be handled by previous extraction and clustering steps and removal of clusters marked as chimeric.
* **population comparison** - Final population comparison of all sample final results is done to analyze haplotypes across samples to get haplotypes of the whole population
* **variant and translation** - Comparison to reference genome to call variants against reference and to translate final haplotypes
:::
# Getting usage command line
Just typing the name of the program will give a help message on running the program
```{r, engine='bash',comment="",eval=FALSE}
SeekDeep processClusters
```
```{r, engine='bash',comment="",highlight=TRUE, echo=FALSE}
SeekDeep processClusters | gsed -r "s/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[m|K]//g" | head -15
echo ...
echo ...
```
Also calling `-help` will do the same
```{r, engine='bash',comment="",eval=FALSE}
SeekDeep processClusters --help
```
Also all flags in SeekDeep are case insensitive and so all the following would have the same results
```{r, engine='bash',comment="",eval=FALSE}
SeekDeep processClusters --help
SeekDeep processClusters --HELP
SeekDeep processClusters --HeLP
SeekDeep processClusters --HeLp
SeekDeep PROCESSCLUSTERS --HeLP
SeekDeep processclusters --HeLp
SeekDeep ProcessClusters --HeLp
```
```{r, engine='bash',comment="",eval=FALSE, echo=FALSE}
SeekDeep processClusters --getFlags
```
```{r, engine='bash',comment="",highlight=TRUE, eval=FALSE, echo=FALSE}
SeekDeep processClusters --getFlags | gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g" | head -20
echo ...
echo ...
```
# Input
The input to processClusters is done by running it in a directory with several results files for each sample in subdirectories below the top directory. An example directory tree below
```{r, engine='bash',comment="",highlight=TRUE, eval=T, echo=FALSE}
tree -n ../extraFiles/sampleClusteringDir | head -21
```
Each output file needs to be named the same (here output.fastq) and the sequences in the output should be named so that they have a suffix indicating how many reads are associated with each cluster. Suffix should be _[NUM] or _t[NUM] where [NUM] is the number of reads associated with each cluster.
# Output files
An output directory will be created to store the output of processClusters, the name of which can be changed by using the `-dout` flag. In this directory will be several files and directories.
* **selectedClustersInfo.tab.txt** - This is a very table containing all information of replicate comparison, final samples haplotypes, and any population information if population clustering has been done
* **final** - This directory will contain a sequence file for each sample that was in the clustering and will be the final haplotypes for the analysis. Each sequence in the sequence file will have a suffix in the format _f[NUM] where [NUM] is a relative abundance in the sample so _f1.0 would be 100% of the sample and _f.50 would be a haplotype that took up about 50% of the sample.
* **sampRepAgreementInfo** - This folder contains several files that report how well replicates agree including a file containing graphs to show this
If population clustering was enabled
* **population** - This directory will be create if population clustering is enable and will contain information about the population of haplotypes found in the samples
* **population/populationCluster.tab.txt** - This will contain information about how many samples the population haplotypes appear in.
* **population/PopUID.fastq** - This will contain the sequences for all the haplotypes found in the population, they will have have a suffix with _f[NUM] where [NUM] is the average relative abundance it was found at. For example if the haplotype was found in 3 different sample at relative abundances of .5, .6, and .7 the suffix here would be _f0.6
If groupings info was supplied
* **groups** - This directory will be created if grouping information is supplied, see [below](#supplying-grouping-info) for details.
## Column explaination for selectedClustersInfo.tab.txt
All the data from the analysis done by processClusters is located in this file. It is a very large file and can be overwhelming at first but all data is conveniently located in one place. How the table is organized is each row represents a haplotype from a sample. Therefore each row will report stats on the haplotype that will have to do with how it was found in the replicates, what the final stats of the haplotype after replicate comparison was, and if population analysis is being done stats on how the haplotype appears in the population. At the same time general stats for each sample and replicate are reported and therefore several columns will have redundant information because there will be several haplotypes for each replicate and sample. Also to define some terms to make sense of the column heading. Each sample will have a one or more replicates, these are normally PCR replicates. Reads from these replicates will first be clustered separately and then compared to each other to make the final data for the sample. The results from this comparison will results in what will be called clusters and each cluster will have a relative abundance in the sample totaling up to 1. After clusters are finalized in the samples they are then the final clusters from each samples are compared to each other to create population data incorporating all samples. This population analysis will result will what will be called haplotypes. So while in the sample these sequences are referred to as clusters and in the population as haplotype to distinguish when talking about each one. Each column has a prefix to indicate what kind of information it being reported on. s_ indicates data about the sample, p_ indicates data about the whole population h_ indicates data about the haplotype, c_ indicates data about the cluster while in the sample and RXX indicates information in the replicates where XX is the replicate number.
* **s_Sample** - The name of the sample the haplotype appears in
* **h_popUID** - The population id given to haplotype. These are named with PopUID.XX where XX is the population rank number and this number is determined by the number of samples the haplotype appears in
* **p_TotalPopulationSampCnt** - This the the total number of samples in the population analysis
***Haplotype Info***
* **h_PopFrac** - This is the sum of the relative abundances the haplotype was found in divided by the total number of samples in the population.
* **h_SumOfAllFracs** - This is the sum of all the relative abundances the haplotype was found at
* **h_AvgFracFoundAt** - This is the average relative abundance the haplotype was found at
* **h_ReadFrac** - This is the total number of reads that are associated with the haplotype across all samples divided by the total number of reads in the the whole experiment
* **h_SampCnt** - This is the number of samples the haplotype was found in.
* **h_SampFrac** - This is the number of samples the haplotype was found in divided by the number of samples in the experiment
* **h_ReadCnt** - This is the total number of reads associated with this haplotype
* **h_ClusterCnt** - This is the total number of clusters that went into making this population haplotype, this is normally the number of samples the haplotype was found in because otherwise that would mean a sample contributed more than one cluster to the haplotype.
* **h_clusterNames** - This is the names of the contributing clusters
***Sample Info***
* **s_Name** - The name of the sample the haplotype appears in
* **s_ReadCntTotUsed** - This is the number of reads used out of this sample given with percentage of the total number of reads given to processClusters, the other percentage would be what was filtered off (low frequency and chimeras)
* **s_InputClusterCnt** - The total number of clusters at the beginning of the analysis for this sample
* **s_FinalClusterCnt** - The total number of clusters after analysis
***Cluster Info***
* **c_clusterID** - The id given to the cluster in the sample, the lower the number the higher the relative abundance in the sample
* **c_AveragedFrac** - The averaged relative abundance between all replicates used in the analysis (this basically an un-weighted average of the abundances since it wouldn't depend on read counts)
* **c_ReadFrac** - The number of reads associated with this cluster divided by the total number of reads found in all replicates, if there is only one replicate this number will be the same as c_AveragedFrac.
* **c_ReadCnt** - The total number of reads associated with this cluster summed over all replicates
* **c_RepCnt** - The number of reps this cluster was found in
* **c_Consensus** - The consensus sequence for this cluster
* **c_InputCnt** - The number of contributing replicate clusters to this sample cluster
* **c_ChiReadCnt** - The number of reads for the replicate clusters that are marked chimeric
* **c_ChiClusCnt** - The number of replicate clusters that are marked as chimeric
* **c_ChiRepCnt** - The number of replicated this current sample was marked as chimeric
* **c_InputNames** - The names of the replicate clusters that contributed to making this sample cluster
* **RXX_name** - The number of the replicate
* **RXX_totalClusterCntExcluded** - The total number of replicate clusters that were excluded from this replicate
* **RXX_totalCntExcluded** - The total number of reads that were excluded from the replicate
* **RXX_totalFracExcluded** - The fraction of reads excluded from this replicate
* **RXX_clusterCntChiExcluded** - The number of clusters that were marked chimeric in this replicate
* **RXX_cntChiExcluded** - The total read count marked as chimeric in this replicate
* **RXX_fracChiExcluded** - The fraction of the replicate that was excluded due to chimeric formation
* **RXX_MapFrac** - The relative abundance of this replicate cluster in this replicate
* **RXX_ReadCnt** - The total number of reads this replicate cluster has
* **RXX_ClusCnt** - The total number of clusters in this replicate
* **RXX_totalReadCnt** - The total number of reads in this replicate
* **c_bestExpected** - A string with the name of the closest reference sequence see [below](#supplying-expected-reference-sequences) to see how to compare against expected references
## Column explaination for populationCluster.tab.txt
* **h_popUID** - The population id given to haplotype. These are named with PopUID.XX where XX is the population rank number and this number is determined by the number of samples the haplotype appears in
* **p_TotalInputReadCnt** - The total number of reads in the whole experiment/population
* **p_TotalInputClusterCnt** - The total number of clusters in the whole experiment/population
* **p_TotalPopulationSampCnt** - The total number of samples in the whole experiment/population
* **p_TotalHaplotypes** - The total number of final haplotypes in the whole experiment/population
* **p_meanCoi** - The mean complexity of infection (COI) in the whole experiment/population
* **p_medianCoi** - The median complexity of infection (COI) in the whole experiment/population
* **p_minCoi** - The minimum complexity of infection (COI) in the whole experiment/population
* **p_maxCoi** - The max complexity of infection (COI) in the whole experiment/population
* **h_PopFrac** - This is the sum of the relative abundances the haplotype was found in divided by the total number of samples in the population.
* **h_SumOfAllFracs** - This is the sum of all the relative abundances the haplotype was found at
* **h_AvgFracFoundAt** - This is the average relative abundance the haplotype was found at
* **h_ReadFrac** - This is the total number of reads that are associated with the haplotype across all samples divided by the total number of reads in the the whole experiment
* **h_SampCnt** - This is the number of samples the haplotype was found in.
* **h_SampFrac** - This is the number of samples the haplotype was found in divided by the number of samples in the experiment
* **h_ReadCnt** - This is the total number of reads associated with this haplotype
* **h_ClusterCnt** - This is the total number of clusters that went into making this population haplotype, this is normally the number of samples the haplotype was found in because otherwise that would mean a sample contributed more than one cluster to the haplotype.
* **h_clusterNames** - This is the names of the contributing clusters
* **h_Consensus** - The consensus sequence of all contributing clusters to this haplotype
* **h_Protein** - An estimated protein, will only be correct if the reading frame starts at 0
* **c_bestExpected** - A string with the name of the closest reference sequence see [below](#supplying-expected-reference-sequences) to see how to compare against expected references
# Examples
## Running
processClusters needs only two options to run, name of the results files given by `--fastq` or `--fasta` in the directory tree (see [above](#input)) and a parameters file `--par` see [qluster usage](qluster_usage.html##format-of-paramters-files) for explanation of parameters file. Alternately to the `--par` you can use either `--noErrors` to allow for now errors to collapse between comparisons or `--strictErrors` to allow for just a few low quality mismatches.
## Input results file format
Input reads can be fasta or fastq
### Fastq
```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
SeekDeep processClusters --fastq output.fastq --par pars.txt
```
### Fasta
```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
SeekDeep processClusters --fasta output.fasta --par pars.txt
```
## Pars
### Just do a simple match allowing no differences
```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
SeekDeep processClusters --fastq output.fastq --noErrors
```
### Allowing only 1 high quality mismatch
```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
SeekDeep processClusters --fastq output.fastq --noErrors --hq 1
```
### Allowing a few low quality mismtaches and some 1 base indels (good for 454 and Ion Torrent)
```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
SeekDeep processClusters --fastq output.fastq --strictErrors
```
### Allowing a few low quality mismtaches and but no indels (illumina)
```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
SeekDeep processClusters --fastq output.fastq --strictErrors --illumina
```
### Allowing a few low quality mismtaches and but no indels (illumina) and 1 high quality mismatch
```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
SeekDeep processClusters --fastq output.fastq --strictErrors --illumina --hq 1
```
## Different Population Parameters
If you want to give different parameters to do the population clustering with just use the '-popPar` flag
```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
SeekDeep processClusters --fastq output.fastq --par pars.txt --popPar popPars.txt
```
```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
SeekDeep processClusters --fastq output.fastq --strictErrors --pop-noErrors
```
## Supplying grouping info
To output additional files of final results organized by groups, supply a grouping tab delimited file where the first column has sample names and each additional column is named for a group with sub-groups that each sample belongs to. See below for an example file
```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
cat exampleGroups.tab.txt
```
```{r, engine='bash',comment="",highlight=TRUE, eval=T, echo=F}
cat ../extraFiles/exampleGroups.tab.txt | column -t
```
Just give this file name to the `--groups` flag
```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
SeekDeep processClusters --fastq output.fastq --par pars.txt --groups exampleGroups.tab.txt
```
This will create a directory called **groups** in the processClusters output directory. This will contain a series of sub directories of results files split by group and sub-group with a file called sampFile.tab.txt (set up like [selectedClusters.tab.txt](#column-explaination-for-selectedclustersinfo.tab.txt)) and a file called popFile.tab.txt (set up like [populationCluster.tab.txt](#column-explaination-forcluster.tab.txt)) but will only contain information about that subgroup, the population UIDs will be the same so they can be linked to the main results data. This is still a new feature and will be developed further in future release when analysis between groups will automatically be done but for now the most interesting thing to look at would be the COI stats in each popFile.tab.txt between sub-groups in each group.
## Supplying expected reference sequences
To supply a reference file to compare sequences to (this is helpful when you have controls with expected mixture done along with your experiment) use the `--ref` or `--refFastq` flag. The `--ref` flag is for when the sequences are in a fasta file and `--refFastq` is for when the seq file is a fastq file.
```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
SeekDeep processClusters --fastq output.fastq --par pars.txt --ref references.fasta
#or fastq
SeekDeep processClusters --fastq output.fastq --par pars.txt --refFastq references.fastq
```
## Supplying experiment name
By default the final haplotypes are given the PopUID.XX but you can change the PopUID to be the name of your experiment
```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
SeekDeep processClusters --fastq output.fastq --par pars.txt --experimentName myExperiment
```
## Giving a relative abundance cut off
By default a relative abundance cut off of 0.005 (0.5%) is done to clear out low abundant clusters are most likely due to current technologies and PCR limitation and shouldn't be done in the experiment. If you're data is especially noisy or you want to be more stringent in your cut off use the `--fracCutOff` flag, for example the filter off all clusters at 1% and lower use the below example
```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
SeekDeep processClusters --fastq output.fastq --par pars.txt --fracCutOff 0.01
```
## Removing singletons
By default all singletons clusters are thrown out before analysis is done per replicate. To change this behavior use the `--clusterCutOff` flag
```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
#to keep singletons
SeekDeep processClusters --fastq output.fastq --par pars.txt --clusterCutOff 0
#to remove clusters of size 2 and less
SeekDeep processClusters --fastq output.fastq --par pars.txt --clusterCutOff 2
```
## Marking Chimeras and checking Chimeras
By default all sample clusters that are made up of mostly clusters with the CHI_ flag are removed from analysis. To keep chimeras use the `--keepChimera` flag. Also marking clusters can be done with [SeekDeep qluster](qluster_usage.html#marking-possible-chimeric-sequence) but can also be done by processClusters by using the same flags as qluster to make them, `--markChimeras` and `--parFreqs` see [SeekDeep qluster](qluster_usage.html#marking-possible-chimeric-sequence) for an explanation of these flags.
```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
#keep chimeras
SeekDeep processClusters --fastq output.fastq --par pars.txt --keepChimeras
#mark chimeras when they have parents that are greater than 3x in frequency
SeekDeep processClusters --fastq output.fastq --par pars.txt --makrChimeras --parFreqs 3
```
Also marking all sequences that appear as chimeric can be removing low level recombinants. To try to prevent some of this from happening you can use the `--investigateChimeras` flag which will take clusters marked as chimeric and see if they are ever one of the two major clusters in an other sample. The reasoning behind this is that chimeric clusters need two parents and if a cluster appears as one of the two major clusters in a sample it's not possible to have two parents that are greater than it is.
```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
SeekDeep processClusters --fastq output.fastq --par pars.txt --investigateChimeras
```
## Extra cleanup
`SeekDeep processClusters` also offers the clean up of low frequencies one-off haplotypes at the end of clustering. This can be especially helpful for when there are no replicates to help correct for PCR noise and there might be several haplotypes that are at low frequency and only 1 different from a major higher in frequency haplotypes (e.g. major haplotype at 95% and a haplotype 1 base different at 1%). This can be a good alternative to simply collapsing all one-off haplotypes that would collapse haplotypes both at high frequency like 50% and 50%.
```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
SeekDeep processClusters --fastq output.fastq --strictErrors --pop-noErrors --collapseLowFreqOneOffs
```
The different in frequencies is set by default so that the higher frequency haplotype has to be found at 30x the frequency of the lower frequency haplotypes (this number was arrived at by empirical means by looking at various control mixtures). This number can be changed using `--lowFreqMultiplier`
```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
SeekDeep processClusters --fastq output.fastq --strictErrors --pop-noErrors --collapseLowFreqOneOffs --lowFreqMultiplier 40
```
## Variation calling and translation
`SeekDeep processClusters` also offers the ability to compare the final results to a reference genome(**--genomeFnp**) and associated annotation file(**--gffFnp**) to create a VCF as well as translate final haplotypes based on annotation to also create a protein VCF as well. You can optionally supply a list of known drug associated mutations(**--knownAminoAcidChangesFnp**) to report the counts of these locations regardless if they are variant or not, an example of this type of file can be found here, <https://seekdeep.brown.edu/data/plasmodiumData/genomes/pf/info/pf_drug_resistant_aaPositions.tsv>, should have a column of geneID that can be found in **--gffFnp** and 1-based position of the amino acid of interest.
```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
SeekDeep processClusters --strictErrors --dout analysis --fastqgz output.fastq.gz --overWriteDir --illumina --removeOneSampOnlyOneOffHaps --excludeCommonlyLowFreqHaplotypes --excludeLowFreqOneOffs --rescueExcludedOneOffLowFreqHaplotypes --genomeFnp /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --gffFnp /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv
```
The various variant calling results will be found in the output directory (above **analysis**) in a folder called **variantCalling**