SeekDeep
  • Home
  • Installing
    • Mac OS
    • Ubuntu
    • Windows
    • Vagrant/virtual image (any system)
  • Code
    • Github
  • Usages

    • Pipeline
    • extractor/extractorPairedEnd
    • makeSampleDirectories
    • qluster
    • processClusters
    • popClusteringViewer

    • Pipeline Wrapper
    • setupTarAmpAnalysis

    • Utilities
    • genTargetInfoFromGenomes
    • SeekDeep control mixture benchmarking
    • SeekDeep Variant Calling
  • Tutorials

    • Single End
    • Ion Torrent with MIDs

    • Illumina Paired End
    • Paired End No MIDs/Barcodes
    • Paired End With MIDs/Barcodes
  • Misc Info
    • Illumina Paired End Info
  • References
    • versions
    • References

    Contents

    • Getting usage command line
    • Input
    • Output files
      • Column explaination for selectedClustersInfo.tab.txt
      • Column explaination for populationCluster.tab.txt
    • Examples
      • Running
      • Input results file format
        • Fastq
        • Fasta
      • Pars
        • Just do a simple match allowing no differences
        • Allowing only 1 high quality mismatch
        • Allowing a few low quality mismtaches and some 1 base indels (good for 454 and Ion Torrent)
        • Allowing a few low quality mismtaches and but no indels (illumina)
        • Allowing a few low quality mismtaches and but no indels (illumina) and 1 high quality mismatch
      • Different Population Parameters
      • Supplying grouping info
      • Supplying expected reference sequences
      • Supplying experiment name
      • Giving a relative abundance cut off
      • Removing singletons
      • Marking Chimeras and checking Chimeras
      • Extra cleanup
      • Variation calling and translation

    • Show All Code
    • Hide All Code

    • View Source
    processClusters’s objectives

    processClusters takes care of several of the final steps of the SeekDeep pipeline.

    • replicate comparison - If multiple replicates are available per sample it is able to take advantage of this and take only clusters that appear in all replicates
    • final results filtering - A filter is applied to the final results where low frequency clusters and singlet clusters are removed. This step is applied to get rid of artifact sequence that couldn’t be handled by previous extraction and clustering steps and removal of clusters marked as chimeric.
    • population comparison - Final population comparison of all sample final results is done to analyze haplotypes across samples to get haplotypes of the whole population
    • variant and translation - Comparison to reference genome to call variants against reference and to translate final haplotypes

    Getting usage command line

    Just typing the name of the program will give a help message on running the program

    Code
    SeekDeep processClusters 
    SeekDeep processClusters
    Ran from inside directory tree set up such that currentDir/SampleDir/RunDir/seqFiles
                                     Required Options
                                        Clustering
                                      --par ParametersFileName; required; default=; ([38;5;202mstd::string)
                                  Reading Sequence Input
                                    --fasta Input sequence filename, only need 1, fasta text file; required; default=; ([38;5;202mboost::filesystem::path)
                                  --fastagz Input sequence filename, only need 1, fasta gzipped file; required; default=; ([38;5;202mboost::filesystem::path)
                                    --fastq Input sequence filename, only need 1, fastq text file; required; default=; ([38;5;202mboost::filesystem::path)
                                  --fastqgz Input sequence filename, only need 1, fastq gzipped file; required; default=; ([38;5;202mboost::filesystem::path)
                                     Optional Options
                                    Additional Output
                                    --extra Extra Output; default=false; ([38;5;202mbool)
                   --writeExcludedOriginals Write out the excluded Originals; default=false; ([38;5;202mbool)
                                        Alignment
    ...
    ...

    Also calling -help will do the same

    Code
    SeekDeep processClusters --help 

    Also all flags in SeekDeep are case insensitive and so all the following would have the same results

    Code
    SeekDeep processClusters --help 
    SeekDeep processClusters --HELP
    SeekDeep processClusters --HeLP
    SeekDeep processClusters --HeLp
    SeekDeep PROCESSCLUSTERS --HeLP
    SeekDeep processclusters --HeLp
    SeekDeep ProcessClusters --HeLp

    Input

    The input to processClusters is done by running it in a directory with several results files for each sample in subdirectories below the top directory. An example directory tree below

    ../extraFiles/sampleClusteringDir
    ├── Sample01
    │   ├── MID01
    │   │   └── output.fastq
    │   └── MID02
    │       └── output.fastq
    ├── Sample02
    │   ├── MID03
    │   │   └── output.fastq
    │   └── MID04
    │       └── output.fastq
    ├── Sample03
    │   ├── MID05
    │   │   └── output.fastq
    │   └── MID06
    │       └── output.fastq
    └── Sample04
        ├── MID07
        │   └── output.fastq
        └── MID08
            └── output.fastq

    Each output file needs to be named the same (here output.fastq) and the sequences in the output should be named so that they have a suffix indicating how many reads are associated with each cluster. Suffix should be _[NUM] or _t[NUM] where [NUM] is the number of reads associated with each cluster.

    Output files

    An output directory will be created to store the output of processClusters, the name of which can be changed by using the -dout flag. In this directory will be several files and directories.

    • selectedClustersInfo.tab.txt - This is a very table containing all information of replicate comparison, final samples haplotypes, and any population information if population clustering has been done
    • final - This directory will contain a sequence file for each sample that was in the clustering and will be the final haplotypes for the analysis. Each sequence in the sequence file will have a suffix in the format _f[NUM] where [NUM] is a relative abundance in the sample so _f1.0 would be 100% of the sample and _f.50 would be a haplotype that took up about 50% of the sample.
    • sampRepAgreementInfo - This folder contains several files that report how well replicates agree including a file containing graphs to show this If population clustering was enabled
    • population - This directory will be create if population clustering is enable and will contain information about the population of haplotypes found in the samples
    • population/populationCluster.tab.txt - This will contain information about how many samples the population haplotypes appear in.
    • population/PopUID.fastq - This will contain the sequences for all the haplotypes found in the population, they will have have a suffix with _f[NUM] where [NUM] is the average relative abundance it was found at. For example if the haplotype was found in 3 different sample at relative abundances of .5, .6, and .7 the suffix here would be _f0.6 If groupings info was supplied
    • groups - This directory will be created if grouping information is supplied, see below for details.

    Column explaination for selectedClustersInfo.tab.txt

    All the data from the analysis done by processClusters is located in this file. It is a very large file and can be overwhelming at first but all data is conveniently located in one place. How the table is organized is each row represents a haplotype from a sample. Therefore each row will report stats on the haplotype that will have to do with how it was found in the replicates, what the final stats of the haplotype after replicate comparison was, and if population analysis is being done stats on how the haplotype appears in the population. At the same time general stats for each sample and replicate are reported and therefore several columns will have redundant information because there will be several haplotypes for each replicate and sample. Also to define some terms to make sense of the column heading. Each sample will have a one or more replicates, these are normally PCR replicates. Reads from these replicates will first be clustered separately and then compared to each other to make the final data for the sample. The results from this comparison will results in what will be called clusters and each cluster will have a relative abundance in the sample totaling up to 1. After clusters are finalized in the samples they are then the final clusters from each samples are compared to each other to create population data incorporating all samples. This population analysis will result will what will be called haplotypes. So while in the sample these sequences are referred to as clusters and in the population as haplotype to distinguish when talking about each one. Each column has a prefix to indicate what kind of information it being reported on. s_ indicates data about the sample, p_ indicates data about the whole population h_ indicates data about the haplotype, c_ indicates data about the cluster while in the sample and RXX indicates information in the replicates where XX is the replicate number.

    • s_Sample - The name of the sample the haplotype appears in
    • h_popUID - The population id given to haplotype. These are named with PopUID.XX where XX is the population rank number and this number is determined by the number of samples the haplotype appears in
    • p_TotalPopulationSampCnt - This the the total number of samples in the population analysis
      Haplotype Info
    • h_PopFrac - This is the sum of the relative abundances the haplotype was found in divided by the total number of samples in the population.
    • h_SumOfAllFracs - This is the sum of all the relative abundances the haplotype was found at
    • h_AvgFracFoundAt - This is the average relative abundance the haplotype was found at
    • h_ReadFrac - This is the total number of reads that are associated with the haplotype across all samples divided by the total number of reads in the the whole experiment
    • h_SampCnt - This is the number of samples the haplotype was found in.
    • h_SampFrac - This is the number of samples the haplotype was found in divided by the number of samples in the experiment
    • h_ReadCnt - This is the total number of reads associated with this haplotype
    • h_ClusterCnt - This is the total number of clusters that went into making this population haplotype, this is normally the number of samples the haplotype was found in because otherwise that would mean a sample contributed more than one cluster to the haplotype.
    • h_clusterNames - This is the names of the contributing clusters
      Sample Info
    • s_Name - The name of the sample the haplotype appears in
    • s_ReadCntTotUsed - This is the number of reads used out of this sample given with percentage of the total number of reads given to processClusters, the other percentage would be what was filtered off (low frequency and chimeras)
    • s_InputClusterCnt - The total number of clusters at the beginning of the analysis for this sample
    • s_FinalClusterCnt - The total number of clusters after analysis
      Cluster Info
    • c_clusterID - The id given to the cluster in the sample, the lower the number the higher the relative abundance in the sample
    • c_AveragedFrac - The averaged relative abundance between all replicates used in the analysis (this basically an un-weighted average of the abundances since it wouldn’t depend on read counts)
    • c_ReadFrac - The number of reads associated with this cluster divided by the total number of reads found in all replicates, if there is only one replicate this number will be the same as c_AveragedFrac.
    • c_ReadCnt - The total number of reads associated with this cluster summed over all replicates
    • c_RepCnt - The number of reps this cluster was found in
    • c_Consensus - The consensus sequence for this cluster
    • c_InputCnt - The number of contributing replicate clusters to this sample cluster
    • c_ChiReadCnt - The number of reads for the replicate clusters that are marked chimeric
    • c_ChiClusCnt - The number of replicate clusters that are marked as chimeric
    • c_ChiRepCnt - The number of replicated this current sample was marked as chimeric
    • c_InputNames - The names of the replicate clusters that contributed to making this sample cluster
    • RXX_name - The number of the replicate
    • RXX_totalClusterCntExcluded - The total number of replicate clusters that were excluded from this replicate
    • RXX_totalCntExcluded - The total number of reads that were excluded from the replicate
    • RXX_totalFracExcluded - The fraction of reads excluded from this replicate
    • RXX_clusterCntChiExcluded - The number of clusters that were marked chimeric in this replicate
    • RXX_cntChiExcluded - The total read count marked as chimeric in this replicate
    • RXX_fracChiExcluded - The fraction of the replicate that was excluded due to chimeric formation
    • RXX_MapFrac - The relative abundance of this replicate cluster in this replicate
    • RXX_ReadCnt - The total number of reads this replicate cluster has
    • RXX_ClusCnt - The total number of clusters in this replicate
    • RXX_totalReadCnt - The total number of reads in this replicate
    • c_bestExpected - A string with the name of the closest reference sequence see below to see how to compare against expected references

    Column explaination for populationCluster.tab.txt

    • h_popUID - The population id given to haplotype. These are named with PopUID.XX where XX is the population rank number and this number is determined by the number of samples the haplotype appears in
    • p_TotalInputReadCnt - The total number of reads in the whole experiment/population
    • p_TotalInputClusterCnt - The total number of clusters in the whole experiment/population
    • p_TotalPopulationSampCnt - The total number of samples in the whole experiment/population
    • p_TotalHaplotypes - The total number of final haplotypes in the whole experiment/population
    • p_meanCoi - The mean complexity of infection (COI) in the whole experiment/population
    • p_medianCoi - The median complexity of infection (COI) in the whole experiment/population
    • p_minCoi - The minimum complexity of infection (COI) in the whole experiment/population
    • p_maxCoi - The max complexity of infection (COI) in the whole experiment/population
    • h_PopFrac - This is the sum of the relative abundances the haplotype was found in divided by the total number of samples in the population.
    • h_SumOfAllFracs - This is the sum of all the relative abundances the haplotype was found at
    • h_AvgFracFoundAt - This is the average relative abundance the haplotype was found at
    • h_ReadFrac - This is the total number of reads that are associated with the haplotype across all samples divided by the total number of reads in the the whole experiment
    • h_SampCnt - This is the number of samples the haplotype was found in.
    • h_SampFrac - This is the number of samples the haplotype was found in divided by the number of samples in the experiment
    • h_ReadCnt - This is the total number of reads associated with this haplotype
    • h_ClusterCnt - This is the total number of clusters that went into making this population haplotype, this is normally the number of samples the haplotype was found in because otherwise that would mean a sample contributed more than one cluster to the haplotype.
    • h_clusterNames - This is the names of the contributing clusters
    • h_Consensus - The consensus sequence of all contributing clusters to this haplotype
    • h_Protein - An estimated protein, will only be correct if the reading frame starts at 0
    • c_bestExpected - A string with the name of the closest reference sequence see below to see how to compare against expected references

    Examples

    Running

    processClusters needs only two options to run, name of the results files given by --fastq or --fasta in the directory tree (see above) and a parameters file --par see qluster usage for explanation of parameters file. Alternately to the --par you can use either --noErrors to allow for now errors to collapse between comparisons or --strictErrors to allow for just a few low quality mismatches.

    Input results file format

    Input reads can be fasta or fastq

    Fastq

    Code
    SeekDeep processClusters --fastq output.fastq --par pars.txt

    Fasta

    Code
    SeekDeep processClusters --fasta output.fasta --par pars.txt

    Pars

    Just do a simple match allowing no differences

    Code
    SeekDeep processClusters --fastq output.fastq --noErrors

    Allowing only 1 high quality mismatch

    Code
    SeekDeep processClusters --fastq output.fastq --noErrors --hq 1

    Allowing a few low quality mismtaches and some 1 base indels (good for 454 and Ion Torrent)

    Code
    SeekDeep processClusters --fastq output.fastq --strictErrors

    Allowing a few low quality mismtaches and but no indels (illumina)

    Code
    SeekDeep processClusters --fastq output.fastq --strictErrors --illumina

    Allowing a few low quality mismtaches and but no indels (illumina) and 1 high quality mismatch

    Code
    SeekDeep processClusters --fastq output.fastq --strictErrors --illumina --hq 1

    Different Population Parameters

    If you want to give different parameters to do the population clustering with just use the ’-popPar` flag

    Code
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --popPar popPars.txt
    Code
    SeekDeep processClusters --fastq output.fastq --strictErrors --pop-noErrors

    Supplying grouping info

    To output additional files of final results organized by groups, supply a grouping tab delimited file where the first column has sample names and each additional column is named for a group with sub-groups that each sample belongs to. See below for an example file

    Code
    cat exampleGroups.tab.txt 
    samples   age    location
    Sample01  child  loc1
    Sample02  child  loc2
    Sample03  adult  loc1
    Sample04  adult  loc2

    Just give this file name to the --groups flag

    Code
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --groups exampleGroups.tab.txt

    This will create a directory called groups in the processClusters output directory. This will contain a series of sub directories of results files split by group and sub-group with a file called sampFile.tab.txt (set up like selectedClusters.tab.txt) and a file called popFile.tab.txt (set up like populationCluster.tab.txt) but will only contain information about that subgroup, the population UIDs will be the same so they can be linked to the main results data. This is still a new feature and will be developed further in future release when analysis between groups will automatically be done but for now the most interesting thing to look at would be the COI stats in each popFile.tab.txt between sub-groups in each group.

    Supplying expected reference sequences

    To supply a reference file to compare sequences to (this is helpful when you have controls with expected mixture done along with your experiment) use the --ref or --refFastq flag. The --ref flag is for when the sequences are in a fasta file and --refFastq is for when the seq file is a fastq file.

    Code
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --ref references.fasta
    #or fastq
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --refFastq references.fastq

    Supplying experiment name

    By default the final haplotypes are given the PopUID.XX but you can change the PopUID to be the name of your experiment

    Code
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --experimentName myExperiment

    Giving a relative abundance cut off

    By default a relative abundance cut off of 0.005 (0.5%) is done to clear out low abundant clusters are most likely due to current technologies and PCR limitation and shouldn’t be done in the experiment. If you’re data is especially noisy or you want to be more stringent in your cut off use the --fracCutOff flag, for example the filter off all clusters at 1% and lower use the below example

    Code
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --fracCutOff 0.01

    Removing singletons

    By default all singletons clusters are thrown out before analysis is done per replicate. To change this behavior use the --clusterCutOff flag

    Code
    #to keep singletons 
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --clusterCutOff 0
    #to remove clusters of size 2 and less 
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --clusterCutOff 2

    Marking Chimeras and checking Chimeras

    By default all sample clusters that are made up of mostly clusters with the CHI_ flag are removed from analysis. To keep chimeras use the --keepChimera flag. Also marking clusters can be done with SeekDeep qluster but can also be done by processClusters by using the same flags as qluster to make them, --markChimeras and --parFreqs see SeekDeep qluster for an explanation of these flags.

    Code
    #keep chimeras 
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --keepChimeras
    #mark chimeras when they have parents that are greater than 3x in frequency  
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --makrChimeras --parFreqs 3

    Also marking all sequences that appear as chimeric can be removing low level recombinants. To try to prevent some of this from happening you can use the --investigateChimeras flag which will take clusters marked as chimeric and see if they are ever one of the two major clusters in an other sample. The reasoning behind this is that chimeric clusters need two parents and if a cluster appears as one of the two major clusters in a sample it’s not possible to have two parents that are greater than it is.

    Code
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --investigateChimeras

    Extra cleanup

    SeekDeep processClusters also offers the clean up of low frequencies one-off haplotypes at the end of clustering. This can be especially helpful for when there are no replicates to help correct for PCR noise and there might be several haplotypes that are at low frequency and only 1 different from a major higher in frequency haplotypes (e.g. major haplotype at 95% and a haplotype 1 base different at 1%). This can be a good alternative to simply collapsing all one-off haplotypes that would collapse haplotypes both at high frequency like 50% and 50%.

    Code
    SeekDeep processClusters --fastq output.fastq --strictErrors --pop-noErrors --collapseLowFreqOneOffs

    The different in frequencies is set by default so that the higher frequency haplotype has to be found at 30x the frequency of the lower frequency haplotypes (this number was arrived at by empirical means by looking at various control mixtures). This number can be changed using --lowFreqMultiplier

    Code
    SeekDeep processClusters --fastq output.fastq --strictErrors --pop-noErrors --collapseLowFreqOneOffs --lowFreqMultiplier 40

    Variation calling and translation

    SeekDeep processClusters also offers the ability to compare the final results to a reference genome(–genomeFnp) and associated annotation file(–gffFnp) to create a VCF as well as translate final haplotypes based on annotation to also create a protein VCF as well. You can optionally supply a list of known drug associated mutations(–knownAminoAcidChangesFnp) to report the counts of these locations regardless if they are variant or not, an example of this type of file can be found here, https://seekdeep.brown.edu/data/plasmodiumData/genomes/pf/info/pf_drug_resistant_aaPositions.tsv, should have a column of geneID that can be found in –gffFnp and 1-based position of the amino acid of interest.

    Code
    SeekDeep processClusters --strictErrors --dout analysis --fastqgz output.fastq.gz --overWriteDir --illumina --removeOneSampOnlyOneOffHaps --excludeCommonlyLowFreqHaplotypes --excludeLowFreqOneOffs --rescueExcludedOneOffLowFreqHaplotypes --genomeFnp /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --gffFnp /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv

    The various variant calling results will be found in the output directory (above analysis) in a folder called variantCalling

    Source Code
    :::{.callout-note}  
    # processClusters's objectives   
      
    processClusters takes care of several of the final steps of the SeekDeep pipeline.  
    
    *  **replicate comparison** - If multiple replicates are available per sample it is able to take advantage of this and take only clusters that appear in all replicates  
    *  **final results filtering** - A filter is applied to the final results where low frequency clusters and singlet clusters are removed.  This step is applied to get rid of artifact sequence that couldn't be handled by previous extraction and clustering steps and removal of clusters marked as chimeric.  
    *  **population comparison** - Final population comparison of all sample final results is done to analyze haplotypes across samples to get haplotypes of the whole population  
    *  **variant and translation** - Comparison to reference genome to call variants against reference and to translate final haplotypes 
    :::
    
    # Getting usage command line  
      
    Just typing the name of the program will give a help message on running the program  
    ```{r, engine='bash',comment="",eval=FALSE}
    SeekDeep processClusters 
    ```
    
    ```{r, engine='bash',comment="",highlight=TRUE, echo=FALSE}
    SeekDeep processClusters | gsed -r "s/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[m|K]//g" | head -15
    echo ...
    echo ...
    
    ```
    Also calling `-help` will do the same 
    ```{r, engine='bash',comment="",eval=FALSE}
    SeekDeep processClusters --help 
    ```
    Also all flags in SeekDeep are case insensitive and so all the following would have the same results 
    ```{r, engine='bash',comment="",eval=FALSE}
    SeekDeep processClusters --help 
    SeekDeep processClusters --HELP
    SeekDeep processClusters --HeLP
    SeekDeep processClusters --HeLp
    SeekDeep PROCESSCLUSTERS --HeLP
    SeekDeep processclusters --HeLp
    SeekDeep ProcessClusters --HeLp
    ```
    
    ```{r, engine='bash',comment="",eval=FALSE, echo=FALSE}
    SeekDeep processClusters --getFlags
    ```
    
    ```{r, engine='bash',comment="",highlight=TRUE, eval=FALSE, echo=FALSE}
    SeekDeep processClusters --getFlags | gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g" | head -20
    echo ...
    echo ...
    
    ```
    
    
    # Input 
      
    The input to processClusters is done by running it in a directory with several results files for each sample in subdirectories below the top directory.  An example directory tree below  
    
    ```{r, engine='bash',comment="",highlight=TRUE, eval=T, echo=FALSE}
    tree -n ../extraFiles/sampleClusteringDir | head -21
    ```
    Each output file needs to be named the same (here output.fastq) and the sequences in the output should be named so that they have a suffix indicating how many reads are associated with each cluster.  Suffix should be _[NUM] or _t[NUM] where [NUM] is the number of reads associated with each cluster.  
    
    
    # Output files
      
    An output directory will be created to store the output of processClusters, the name of which can be changed by using the `-dout` flag.  In this directory will be several files and directories.
    
    * **selectedClustersInfo.tab.txt** - This is a very table containing all information of replicate comparison, final samples haplotypes, and any population information if population clustering has been done
    * **final** - This directory will contain a sequence file for each sample that was in the clustering and will be the final haplotypes for the analysis.  Each sequence in the sequence file will have a suffix in the format _f[NUM] where [NUM] is a relative abundance in the sample so _f1.0 would be 100% of the sample and _f.50 would be a haplotype that took up about 50% of the sample.
    * **sampRepAgreementInfo** - This folder contains several files that report how well replicates agree including a file containing graphs to show this
    If population clustering was enabled
    * **population** - This directory will be create if population clustering is enable and will contain information about the population of haplotypes found in the samples
    * **population/populationCluster.tab.txt** - This will contain information about how many samples the population haplotypes appear in.  
    * **population/PopUID.fastq** - This will contain the sequences for all the haplotypes found in the population, they will have have a suffix with _f[NUM] where [NUM] is the average relative abundance it was found at.  For example if the haplotype was found in 3 different sample at relative abundances of .5, .6, and .7 the suffix here would be _f0.6
    If groupings info was supplied
    * **groups** - This directory will be created if grouping information is supplied, see [below](#supplying-grouping-info) for details.   
    
    ## Column explaination for selectedClustersInfo.tab.txt
      
    
    All the data from the analysis done by processClusters is located in this file.  It is a very large file and can be overwhelming at first but all data is conveniently located in one place.  How the table is organized is each row represents a haplotype from a sample.  Therefore each row will report stats on the haplotype that will have to do with how it was found in the replicates, what the final stats of the haplotype after replicate comparison was, and if population analysis is being done stats on how the haplotype appears in the population.  At the same time general stats for each sample and replicate are reported and therefore several columns will have redundant information because there will be several haplotypes for each replicate and sample.  Also to define some terms to make sense of the column heading.  Each sample will have a one or more replicates, these are normally PCR replicates.  Reads from these replicates will first be clustered separately and then compared to each other to make the final data for the sample.  The results from this comparison will results in what will be called clusters and each cluster will have a relative abundance in the sample totaling up to 1.  After clusters are finalized in the samples they are then the final clusters from each samples are compared to each other to create population data incorporating all samples.  This population analysis will result will what will be called haplotypes.  So while in the sample these sequences are referred to as clusters and in the population as haplotype to distinguish when talking about each one.  Each column has a prefix to indicate what kind of information it being reported on. s_ indicates data about the sample, p_ indicates data about the whole population h_ indicates data about the haplotype, c_ indicates data about the cluster while in the sample and RXX indicates information in the replicates where XX is the replicate number.  
    
    * **s_Sample** - The name of the sample the haplotype appears in  
    * **h_popUID** - The population id given to haplotype.  These are named with PopUID.XX where XX is the population rank number and this number is determined by the number of samples the haplotype appears in
    * **p_TotalPopulationSampCnt** -  This the the total number of samples in the population analysis   
    ***Haplotype Info***
    * **h_PopFrac** - This is the sum of the relative abundances the haplotype was found in divided by the total number of samples in the population.  
    * **h_SumOfAllFracs** - This is the sum of all the relative abundances the haplotype was found at
    * **h_AvgFracFoundAt** - This is the average relative abundance the haplotype was found at  
    * **h_ReadFrac** -  This is the total number of reads that are associated with the haplotype across all samples divided by the total number of reads in the the whole experiment
    * **h_SampCnt** - This is the number of samples the haplotype was found in.  
    * **h_SampFrac** - This is the number of samples the haplotype was found in divided by the number of samples in the experiment
    * **h_ReadCnt** - This is the total number of reads associated with this haplotype
    * **h_ClusterCnt** - This is the total number of clusters that went into making this population haplotype, this is normally the number of samples the haplotype was found in because otherwise that would mean a sample contributed more than one cluster to the haplotype.  
    * **h_clusterNames** - This is the names of the contributing clusters  
    ***Sample Info***
    * **s_Name** - The name of the sample the haplotype appears in
    * **s_ReadCntTotUsed** - This is the number of reads used out of this sample given with percentage of the total number of reads given to processClusters, the other percentage would be what was filtered off (low frequency and chimeras)
    * **s_InputClusterCnt** - The total number of clusters at the beginning of the analysis for this sample
    * **s_FinalClusterCnt** - The total number of clusters after analysis   
    ***Cluster Info***
    * **c_clusterID** - The id given to the cluster in the sample, the lower the number the higher the relative abundance in the sample
    * **c_AveragedFrac** - The averaged relative abundance between all replicates used in the analysis (this basically an un-weighted average of the abundances since it wouldn't depend on read counts)
    * **c_ReadFrac** - The number of reads associated with this cluster divided by the total number of reads found in all replicates, if there is only one replicate this number will be the same as c_AveragedFrac.  
    * **c_ReadCnt** -  The total number of reads associated with this cluster summed over all replicates
    * **c_RepCnt** - The number of reps this cluster was found in
    * **c_Consensus** - The consensus sequence for this cluster 
    * **c_InputCnt** - The number of contributing replicate clusters to this sample cluster
    * **c_ChiReadCnt** - The number of reads for the replicate clusters that are marked chimeric 
    * **c_ChiClusCnt** - The number of replicate clusters that are marked as chimeric
    * **c_ChiRepCnt** - The number of replicated this current sample was marked as chimeric
    * **c_InputNames** - The names of the replicate clusters that contributed to making this sample cluster 
    * **RXX_name** - The number of the replicate
    * **RXX_totalClusterCntExcluded** -  The total number of replicate clusters that were excluded from this replicate
    * **RXX_totalCntExcluded** - The total number of reads that were excluded from the replicate 
    * **RXX_totalFracExcluded** - The fraction of reads excluded from this replicate
    * **RXX_clusterCntChiExcluded** - The number of clusters that were marked chimeric in this replicate
    * **RXX_cntChiExcluded** - The total read count marked as chimeric in this replicate
    * **RXX_fracChiExcluded** - The fraction of the replicate that was excluded due to chimeric formation 
    * **RXX_MapFrac** - The relative abundance of this replicate cluster in this replicate
    * **RXX_ReadCnt** - The total number of reads this replicate cluster has
    * **RXX_ClusCnt** - The total number of clusters in this replicate
    * **RXX_totalReadCnt** - The total number of reads in this replicate
    * **c_bestExpected** -   A string with the name of the closest reference sequence see [below](#supplying-expected-reference-sequences) to see how to compare against expected references
    
    ## Column explaination for populationCluster.tab.txt
      
    
    * **h_popUID** - The population id given to haplotype.  These are named with PopUID.XX where XX is the population rank number and this number is determined by the number of samples the haplotype appears in  
    * **p_TotalInputReadCnt** - The total number of reads in the whole experiment/population
    * **p_TotalInputClusterCnt** - The total number of clusters in the whole experiment/population
    * **p_TotalPopulationSampCnt** - The total number of samples in the whole experiment/population
    * **p_TotalHaplotypes** - The total number of final haplotypes in the whole experiment/population
    * **p_meanCoi** - The mean complexity of infection (COI) in the whole experiment/population
    * **p_medianCoi** -  The median complexity of infection (COI) in the whole experiment/population
    * **p_minCoi** -  The minimum complexity of infection (COI) in the whole experiment/population
    * **p_maxCoi** -  The max complexity of infection (COI) in the whole experiment/population
    * **h_PopFrac** - This is the sum of the relative abundances the haplotype was found in divided by the total number of samples in the population.  
    * **h_SumOfAllFracs** - This is the sum of all the relative abundances the haplotype was found at
    * **h_AvgFracFoundAt** - This is the average relative abundance the haplotype was found at  
    * **h_ReadFrac** -  This is the total number of reads that are associated with the haplotype across all samples divided by the total number of reads in the the whole experiment
    * **h_SampCnt** - This is the number of samples the haplotype was found in.  
    * **h_SampFrac** - This is the number of samples the haplotype was found in divided by the number of samples in the experiment
    * **h_ReadCnt** - This is the total number of reads associated with this haplotype
    * **h_ClusterCnt** - This is the total number of clusters that went into making this population haplotype, this is normally the number of samples the haplotype was found in because otherwise that would mean a sample contributed more than one cluster to the haplotype.  
    * **h_clusterNames** - This is the names of the contributing clusters  
    * **h_Consensus** - The consensus sequence of all contributing clusters to this haplotype
    * **h_Protein** - An estimated protein, will only be correct if the reading frame starts at 0 
    * **c_bestExpected** -   A string with the name of the closest reference sequence see [below](#supplying-expected-reference-sequences) to see how to compare against expected references
    
    # Examples
     
    
    ## Running
      
    processClusters needs only two options to run, name of the results files given by `--fastq` or `--fasta` in the directory tree (see [above](#input)) and a parameters file `--par` see [qluster usage](qluster_usage.html##format-of-paramters-files) for explanation of parameters file. Alternately to the `--par` you can use either `--noErrors` to allow for now errors to collapse between comparisons or `--strictErrors` to allow for just a few low quality mismatches.   
    
    ## Input results file format 
      
    Input reads can be fasta or fastq 
    
    ### Fastq
    ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
    SeekDeep processClusters --fastq output.fastq --par pars.txt
    ```
    ### Fasta
    ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
    SeekDeep processClusters --fasta output.fasta --par pars.txt
    ```
    
    ## Pars
    
    ### Just do a simple match allowing no differences
    ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
    SeekDeep processClusters --fastq output.fastq --noErrors
    ```
    
    ### Allowing only 1 high quality mismatch
    ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
    SeekDeep processClusters --fastq output.fastq --noErrors --hq 1
    ```
    
    ### Allowing a few low quality mismtaches and some 1 base indels (good for 454 and Ion Torrent)
    ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
    SeekDeep processClusters --fastq output.fastq --strictErrors
    ```
    
    ### Allowing a few low quality mismtaches and but no indels (illumina)
    ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
    SeekDeep processClusters --fastq output.fastq --strictErrors --illumina
    ```
    
    ### Allowing a few low quality mismtaches and but no indels (illumina) and 1 high quality mismatch
    ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
    SeekDeep processClusters --fastq output.fastq --strictErrors --illumina --hq 1
    ```
    
    ## Different Population Parameters
    If you want to give different parameters to do the population clustering with just use the '-popPar` flag
    ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --popPar popPars.txt
    ```
    
    ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
    SeekDeep processClusters --fastq output.fastq --strictErrors --pop-noErrors
    ```
    
    ## Supplying grouping info
      
    To output additional files of final results organized by groups, supply a grouping tab delimited file where the first column has sample names and each additional column is named for a group with sub-groups that each sample belongs to.  See below for an example file
    ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
    cat exampleGroups.tab.txt 
    ```
    ```{r, engine='bash',comment="",highlight=TRUE, eval=T, echo=F}
    cat ../extraFiles/exampleGroups.tab.txt | column -t
    ```
    Just give this file name to the `--groups` flag  
    ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --groups exampleGroups.tab.txt
    ```
    This will create a directory called **groups** in the processClusters output directory.  This will contain a series of sub directories of results files split by group and sub-group with a file called sampFile.tab.txt (set up like [selectedClusters.tab.txt](#column-explaination-for-selectedclustersinfo.tab.txt)) and a file called popFile.tab.txt (set up like [populationCluster.tab.txt](#column-explaination-forcluster.tab.txt)) but will only contain information about that subgroup, the population UIDs will be the same so they can be linked to the main results data.  This is still a new feature and will be developed further in future release when analysis between groups will automatically be done but for now the most interesting thing to look at would be the COI stats in each popFile.tab.txt between sub-groups in each group.  
    
    ## Supplying expected reference sequences
    To supply a reference file to compare sequences to (this is helpful when you have controls with expected mixture done along with your experiment) use the `--ref` or `--refFastq` flag.  The `--ref` flag is for when the sequences are in a fasta file and `--refFastq` is for when the seq file is a fastq file.  
    ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --ref references.fasta
    #or fastq
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --refFastq references.fastq
    ```
    
    ## Supplying experiment name
      
    By default the final haplotypes are given the PopUID.XX but you can change the PopUID to be the name of your experiment 
    ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --experimentName myExperiment
    ```
    
    ## Giving a relative abundance cut off
      
    By default a relative abundance cut off of 0.005 (0.5%) is done to clear out low abundant clusters are most likely due to current technologies and PCR limitation and shouldn't be done in the experiment.  If you're data is especially noisy or you want to be more stringent in your cut off use the `--fracCutOff` flag, for example the filter off all clusters at 1% and lower use the below example 
    ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --fracCutOff 0.01
    ```
    
    ## Removing singletons
      
    By default all singletons clusters are thrown out before analysis is done per replicate.  To change this behavior use the `--clusterCutOff` flag 
    ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
    #to keep singletons 
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --clusterCutOff 0
    #to remove clusters of size 2 and less 
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --clusterCutOff 2
    ```
    
    ## Marking Chimeras and checking Chimeras
      
    By default all sample clusters that are made up of mostly clusters with the CHI_ flag are removed from analysis.  To keep chimeras use the `--keepChimera` flag.  Also marking clusters can be done with [SeekDeep qluster](qluster_usage.html#marking-possible-chimeric-sequence) but can also be done by processClusters by using the same flags as qluster to make them,  `--markChimeras` and `--parFreqs` see [SeekDeep qluster](qluster_usage.html#marking-possible-chimeric-sequence) for an explanation of these flags.  
    ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
    #keep chimeras 
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --keepChimeras
    #mark chimeras when they have parents that are greater than 3x in frequency  
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --makrChimeras --parFreqs 3
    ```
    Also marking all sequences that appear as chimeric can be removing low level recombinants. To try to prevent some of this from happening you can use the `--investigateChimeras` flag which will take clusters marked as chimeric and see if they are ever one of the two major clusters in an other sample.  The reasoning behind this is that chimeric clusters need two parents and if a cluster appears as one of the two major clusters in a sample it's not possible to have two parents that are greater than it is.  
    ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
    SeekDeep processClusters --fastq output.fastq --par pars.txt  --investigateChimeras
    ```
    
    ## Extra cleanup
      
    `SeekDeep processClusters` also offers the clean up of low frequencies one-off haplotypes at the end of clustering. This can be especially helpful for when there are no replicates to help correct for PCR noise and there might be several haplotypes that are at low frequency and only 1 different from a major higher in frequency haplotypes (e.g. major haplotype at 95% and a haplotype 1 base different at 1%). This can be a good alternative to simply collapsing all one-off haplotypes that would collapse haplotypes both at high frequency like 50% and 50%. 
    
    ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
    SeekDeep processClusters --fastq output.fastq --strictErrors --pop-noErrors --collapseLowFreqOneOffs
    ```
    The different in frequencies is set by default so that the higher frequency haplotype has to be found at 30x the frequency of the lower frequency haplotypes (this number was arrived at by empirical means by looking at various control mixtures). This number can be changed using `--lowFreqMultiplier`  
    
    ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
    SeekDeep processClusters --fastq output.fastq --strictErrors --pop-noErrors --collapseLowFreqOneOffs --lowFreqMultiplier 40
    ```
    
    
    ## Variation calling and translation 
      
    `SeekDeep processClusters` also offers the ability to compare the final results to a reference genome(**--genomeFnp**) and associated annotation file(**--gffFnp**) to create a VCF as well as translate final haplotypes based on annotation to also create a protein VCF as well. You can optionally supply a list of known drug associated mutations(**--knownAminoAcidChangesFnp**) to report the counts of these locations regardless if they are variant or not, an example of this type of file can be found here, <https://seekdeep.brown.edu/data/plasmodiumData/genomes/pf/info/pf_drug_resistant_aaPositions.tsv>, should have a column of geneID that can be found in **--gffFnp** and 1-based position of the amino acid of interest.
    
    ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T}
    SeekDeep processClusters --strictErrors --dout analysis --fastqgz output.fastq.gz --overWriteDir --illumina --removeOneSampOnlyOneOffHaps --excludeCommonlyLowFreqHaplotypes --excludeLowFreqOneOffs --rescueExcludedOneOffLowFreqHaplotypes --genomeFnp /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --gffFnp /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv
    
    ```
    
    The various variant calling results will be found in the output directory (above **analysis**) in a folder called **variantCalling**