SeekDeep
  • Home
  • Installing
    • Mac OS
    • Ubuntu
    • Windows
    • Vagrant/virtual image (any system)
  • Code
    • Github
  • Usages

    • Pipeline
    • extractor/extractorPairedEnd
    • makeSampleDirectories
    • qluster
    • processClusters
    • popClusteringViewer

    • Pipeline Wrapper
    • setupTarAmpAnalysis

    • Utilities
    • genTargetInfoFromGenomes
    • SeekDeep control mixture benchmarking
    • SeekDeep Variant Calling
  • Tutorials

    • Single End
    • Ion Torrent with MIDs

    • Illumina Paired End
    • Paired End No MIDs/Barcodes
    • Paired End With MIDs/Barcodes
  • Misc Info
    • Illumina Paired End Info
  • References
    • versions
    • References

    Contents

    • Paired End Extraction
    • Getting usage command line
    • Format of id file
      • Primers
        • Forward Primer
        • Reverse Primer
      • MIDs
    • Output files
    • Examples
      • Required files
        • Different input formats
          • Fastq Input
          • Fasta Input
          • Fasta and Qual Input
        • Paired End
        • Multiplexed and multiple primer pairs
          • Multiplex
          • Multiple primer pairs
        • Paired End
      • Filtering parameters
        • Length
          • Minimum Length
          • Maximum Length
        • Quality
          • Siding window average threshold
          • Quality fraction above a threshold
        • Dry run quality testing
          • Checking quality window
          • checking qualcheck
        • Primers
          • Reverse Primer
      • Additional Options
        • Looking in the reverse complement direction
        • Looking for barcodes/primers at various locations
      • Barcodes
        • Dual barcodes
        • Allowing errors in barcodes
        • Renaming of sequence ids
        • Changing out directory name

    • Show All Code
    • Hide All Code

    • View Source
    extractor’s objectives

    The main purpose of extractor is to take raw data and demultiplex by sample barcodes if present and by primer pairs. It at most requires two arguments: an input file which can be fastq (--fastq) or fasta (--fasta) and an ID file (--id) that supplies primer pairs and barcodes if present. It will also do some quality control as well for length and quality scores.

    Paired End Extraction

    There is a separate extractor command for extracting paired end data and is simply called SeekDeep extractorPairedEnd and shares most of the command line options with regular single end SeekDeep extractor

    Getting usage command line

    Calling -help will give a help message on running the program

    Code
    SeekDeep extractor --help 
    SeekDeep extractorPairedEnd --help 

    Also all flags in SeekDeep are case insensitive and so all the following would have the same results

    Code
    SeekDeep extractor --help 
    SeekDeep extractor --HELP
    SeekDeep extractor --HeLP
    SeekDeep extractor --HeLp
    SeekDeep extrActor --HeLP
    SeekDeep ExtraCtor --HeLp
    SeekDeep EXTRACTOR --HeLp

    Format of id file

    The id file is a tab delimited file that contains both primer pair info and optional barcode information
    Primer information starts with a header line that must start with either gene or target and have two additional columns for forward and reverse primers, the line can start with either gene or target.

    target  forward reverse
    PFAMA1  CAGGGAAATGTCCAGTATT CTTGAACATAAAGTCAATTC
    PFCSP   ACAATCAAGGTAATGGACAAGG  TTTTCAATATCATTTYCATAATCTAATT

    Primers

    The primers can contain ambigious bases (Y,R,etc.) and should be all upper case

    Forward Primer

    Forward primer should be in the 5`-3` direction.

    Reverse Primer

    The reverse primer should in in the 5`-3` direction as well.

    MIDs

    If data is multiplexed the MID name and corresponding barcode information comes after the primer information. Barcode information needs to start with a header id barcode which the id is the important name for parsing. Mids should be in the direction they would be found in if reads are found in the direction of the forward primer. It is always assumed that the mid and primer set up is MID-PRIMER. If this is not the case for your data contact Nick Hathaway (nicholas.hathaway@umassmed.edu) to see if he can add options for accommodating your data structure.

    target  forward reverse
    PFAMA1  CAGGGAAATGTCCAGTATT CTTGAACATAAAGTCAATTC
    PFCSP   ACAATCAAGGTAATGGACAAGG  TTTTCAATATCATTTYCATAATCTAATT
    id  barcode
    MID01   ACGAGTGCGT
    MID02   ACGCTCGACA
    MID03   AGACGCACTC
    MID04   AGCACTGTAG
    MID05   ATCAGACACG
    MID06   ATATCGCGAG
    MID07   CGTGTCTCTA
    MID08   CTCGCGTGTC
    MID10   TCTCTATGCG
    MID11   TGATACGTCT
    MID13   CATAGTAGTG
    MID14   CGAGAGATAC
    MID15   ATACGACGTA
    MID16   TCACGTACTA
    MID17   CGTCTAGTAC
    MID18   TCTACGTAGC

    Output files

    A directory will be created for the output of extractor, this defaults to the name of the file plus the work extractor and then the current date and time of the when the command was run. This can be changed with the --dout flag. In this directory will be the sequences and several files reporting on how the extraction when.

    • extractionProfile.tab.txt - Reports stats on extraction per mid and primer pair
    • extractionStats.tab.txt - Reports stats on the extraction overall
    • runLog_extractor.txt - Contains the date and where the command was run on and what it was and total run time
    • filteredOff - A directory containing the reads that didn’t meet filtering criteria or match the input primers/mids

    Example of extractionProfile.tab.txt

    cat: extraFiles/exampleExtraction/extractionProfile.tab.txt: No such file or directory

    Example of extractionStats.tab.txt

    cat: extraFiles/exampleExtraction/extractionStats.tab.txt: No such file or directory

    The output files will be named by the using the name given in the gene column and the name in id column if the data is multiplexed.

    Code
    ls ./
    PFAMA1MID01.fastq
    PFAMA1MID02.fastq
    PFAMA1MID03.fastq
    PFAMA1MID04.fastq
    PFAMA1MID05.fastq
    PFAMA1MID06.fastq
    PFAMA1MID07.fastq
    PFAMA1MID08.fastq
    extractionProfile.tab.txt
    extractionStats.tab.txt
    parametersUsed.txt
    runLog_extractor.txt

    Examples

    Required files

    SeekDeep extractor and SeekDeep extractorPairedEnd require an input sequence file and an identifier file.

    Different input formats

    Fastq Input

    Code
    SeekDeep extractor --fastq example.fastq --id ids.txt
    SeekDeep extractor --fastqgz example.fastq.gz --id ids.txt

    Fasta Input

    Code
    SeekDeep extractor --fasta example.fasta --id ids.txt

    Fasta and Qual Input

    Code
    SeekDeep extractor --fasta example.fasta --qual example.fasta.qual --id ids.txt
    #or
    SeekDeep extractor --stub example --id ids.txt

    Paired End

    For paired end the only options are fastq or fastqgz and both mates have to be indicated. This also assumes that R1 sequences are in the 5`-3` direction of the top strand, and R2 are in the 5`-3` direction of the bottom strand.

    Code
    SeekDeep extractorPairedEnd --fastq1 example_R1.fastq --fastq2 example_R2.fastq --id ids.txt
    SeekDeep extractorPairedEnd --fastq1gz example_R1.fastq.gz --fastq2gz example_R2.fastq.gz --id ids.txt

    Multiplexed and multiple primer pairs

    Multiplex

    Multiplexing simply requires having MID/barcodes in id file, a header line is required that starts with id to indicate to SeekDeep extractor that barcodes are starting.

    Code
    cat id_file.tab.txt
    target  forwardPrimer   reversePrimer
    PFAMA1  CCATCAGGGAAATGTCCAGT    TTTCCTGCATGTCTTGAACA
    id  barcode
    MID01   ACGAGTGCGT
    MID02   ACGCTCGACA
    MID03   AGACGCACTC
    MID04   AGCACTGTAG
    MID05   ATCAGACACG
    MID06   ATATCGCGAG
    MID07   CGTGTCTCTA
    MID08   CTCGCGTGTC
    Code
    SeekDeep extractor --fastq example.fastq --id id_file.tab.txt --multiplex

    Multiple primer pairs

    Multiple primer pairs is done simply by adding another primer info line, this can also be combined with multiplexing

    Code
    cat multipleGenePairs.id.txt
    gene    forwardPrimer   reversePrimer
    PFAMA1  CCATCAGGGAAATGTCCAGT    TTTCCTGCATGTCTTGAACA
    PFMSP1  AACTAGAAGCTTTAGAAGATGCA ACATATGATTGGTTAAATCAAAG
    id  barcode
    MID01   ACGAGTGCGT
    MID02   ACGCTCGACA
    MID03   AGACGCACTC
    MID04   AGCACTGTAG
    MID05   ATCAGACACG
    MID06   ATATCGCGAG
    MID07   CGTGTCTCTA
    MID08   CTCGCGTGTC
    Code
    SeekDeep extractor --fasta example.fasta --id multipleGenePairs.id.txt 

    Paired End

    For paired reads an additional file is required that indicates if there is overlap between the mates. There are three possible overlap scenarios: noOverLap, r1EndsInR2 (normal overlap goal), and r1BeginsInR2 (happens with read through). If there is overlap SeekDeep extractorPairedEnd will stitch together the mates and if there is no overlap it will extract and have a _R1 and _R2 files for the target. The different overlap scenarios are required as there are many ways to generate artifacts in PCR (unspecific amplification, primer dimers, etc) that could lead to the wrong overlap and therefore reads that overlap in the way that they are expected to are given in the final output files and the rest assumed to be artifact. See Illumina Paired Info Page for a diagram on how sequences overlap.

    The overlap status of the targets is given by a file with two columns, target and status. The order of columns does not matter and the case of the status column does not matter, the target column does. The name given in the target column must match with the names given in the id file.

    Code
    cat overlapStatuses.txt
    target  status
    PFAMA1  R1BeginsInR2
    PFCSP   R1BeginsInR2
    Code
    SeekDeep extractor --fastq1 example_R1.fastq --fastq2 example_R2.fastq --id multipleGenePairs.id.txt --overlapStatusFnp overlapStatuses.txt 

    Filtering parameters

    Length

    Reads are filtered on expected length by choosing a minimum length (--minLen) and a maximum length (--maxLen), these default to be within 20% of the median input read lengths

    Minimum Length

    Code
    SeekDeep extractor --fastq example.fastq --id ids.txt --minLen 150

    Maximum Length

    Code
    SeekDeep extractor --fastq example.fastq --id ids.txt --minLen 150 --maxLen 300

    You can also supply a file with len cutoffs which is especially useful for when you have different expected lengths for different targets, the file needs to contain at least three columns with one column named target which contain the name of the target region which must match the target name in the --id file and two columns minlen and maxlen which contain the min and max length for the target.
    This is supplied with the --lenCutOffs flags.

    Code
    SeekDeep extractor --fastq example.fastq --id ids.txt --lenCutOffs lengthCutOffsForTargets.tab.txt 
    Code
    cat lengthCutOffsForTargets.tab.txt
    target  minlen  maxlen
    PFAMA1  200 240
    PFCSP   254 294

    Quality

    When quality values are supplies they can be used to filter reads as well, there are two options for quality filtering. The default is to use the sliding quality window threshold method, this is good for Ion Torrent, 454, and PacBio. The second quality option is good for Illumina

    Siding window average threshold

    By default a sliding window of size 50 and stepping by every 5 bases and checking to see if the average quality of those 50 bases are above a threshold of 25. The default values can be changed by using the --qualWindow flag and supplying three values separated by commas which are the window size, window step, and window threshold. So the below example would use a window size of 50, a step of 5, and a average threshold of 20

    Code
    SeekDeep extractor --fastq example.fastq --id ids.txt --qualWindow 50,5,20

    Also the default behavior is to throw out any read where this ever fails and this can be changed to trimming the read at the failed window instead using the -qualWindowTrim flag. The minimum length filter is applied after this step so if the trim falls below the min len it will still be filtered out

    Code
    SeekDeep extractor --fastq example.fastq --id ids.txt --qualWindow 50,5,20 --qualWindowTrim

    Quality fraction above a threshold

    A second option for filtering using quality scores is using the distribution of a reads quality scores. This is done by choosing a quality score cut off and checking to see if a certain fraction of quality scores are above a fraction cut off. This is done by using the -qualCheck and the qualCheckCutOff flags. So the below example would throw out any reads where less than 75% of their quality scores was above 25

    Code
    SeekDeep extractor --fastq example.fastq --id ids.txt --qualCheck 25 --qualCheckCutOff .75

    Dry run quality testing

    Since extraction might take some time, it might be nice to see how many reads you would loose with different setting, so a command was added to SeekDeep to do a dry run to output the percentage of reads you would loose if certain quality parameters were used, this is called SeekDeep dryRunQualityFiltering and the arguments are similar to the arguments above.

    Checking quality window

    Code
    SeekDeep dryRunQualityFiltering --fastq example.fastq  --qualWindow 50,5,25
    void njhseq::SeqInput::openInLockFree(): Error file: [31m[1mMultiplex_IT_Tutorial_Materials/IonTorrent1.fastq[0m doesn't exist
    Code
    SeekDeep dryRunQualityFiltering --fastq example.fastq  --qualWindow 50,5,20
    void njhseq::SeqInput::openInLockFree(): Error file: [31m[1mMultiplex_IT_Tutorial_Materials/IonTorrent1.fastq[0m doesn't exist
    Code
    SeekDeep dryRunQualityFiltering --fastq example.fastq  --qualWindow 50,5,18
    void njhseq::SeqInput::openInLockFree(): Error file: [31m[1mMultiplex_IT_Tutorial_Materials/IonTorrent1.fastq[0m doesn't exist

    checking qualcheck

    Code
    SeekDeep extractor --fastq example.fastq --id ids.txt --qualCheck 25 --qualCheckCutOff .75
    void njhseq::SeqInput::openInLockFree(): Error file: [31m[1mMultiplex_IT_Tutorial_Materials/IonTorrent1.fastq[0m doesn't exist
    Code
    SeekDeep extractor --fastq example.fastq --id ids.txt --qualCheck 20 --qualCheckCutOff .75
    void njhseq::SeqInput::openInLockFree(): Error file: [31m[1mMultiplex_IT_Tutorial_Materials/IonTorrent1.fastq[0m doesn't exist

    Primers

    There are various filtering parameters that can be applied to the presence of primers. This includes percent of primer found, the number of mismatches to the primer, and not searching for the primer at all. See SeekDeep extractor -help for more details and defaults

    Reverse Primer

    Looking for primers can be turned off and simple filtering can be done, this requires the flag --noPrimers and that the ID file has just one primer line so that a target name can be given to the out files but the forward and reverse primer columns are ignore.

    Code
    SeekDeep extractor --fastq example.fastq --id ids.txt --noPrimers 

    Additional Options

    Looking in the reverse complement direction

    Depending on the library set up reads are can be found in two directions. Reads found in the reverse complement direction will be reverse complemented and their name will be marked with _Comp so all sequences are in the same direction. There are separate flags for searching MIDs in both directions (–checkRevComplementForMids) and for searching for primers in the both directions (–checkRevComplementForPrimers). And depending on library prep again you might have to combine the two.

    Code
    SeekDeep extractor --fasta example.fasta --id ids.txt  --checkRevComplementForMids
    Code
    SeekDeep extractor --fasta example.fasta --id ids.txt  --checkRevComplementForPrimers
    Code
    SeekDeep extractor --fasta example.fasta --id ids.txt  --checkRevComplementForMids --checkRevComplementForPrimers

    Looking for barcodes/primers at various locations

    Barcodes and primers aren’t always located at the very beginning of the sequences and the search region has to be expanded which can be done using the --midWithinStart flag for mids and --primerWithinStart, the below example would check for the barcodes and primers for start sites within the first 10 bases of the start of the sequence (the primer/MID just has to start within this many bases and not be contained completely with this region, e.g. the primer could be found at base 10 but extend to base 30 and would be caught with this flag). When data is multiplexed the primer within start parameter is the number of bases within the found MID and not the start of the sequence.

    Code
    SeekDeep extractor --fastq example.fastq --id ids.txt --midWithinStart 10
    Code
    SeekDeep extractor --fastq example.fastq --id ids.txt --primerWithinStart 10

    And again, depending on library prep you might need to combine these two flags. For instance, if you had say a variable number of bases of up to 4 in front of your MIDs and then an additional variable number of bases in between your MIDs and primers (lets say up to 10).

    Code
    SeekDeep extractor --fastq example.fastq --id ids.txt --midWithinStart 4 --primerWithinStart 10

    It is advised to keep these numbers low as a high number will greatly increase the search space and could lead to many false positives.

    Barcodes

    Dual barcodes

    Dual barcoding schemes can be done by adding another column to the barcode portion of the input ID file. The first column is assumed to be associated with the forward target primer and the second column is assumed to be associated with the

    Supply specific length cut offs
    Code
    SeekDeep extractor --fastq example.fastq --id ids.txt

    Allowing errors in barcodes

    You can allow mismatches in barcodes using --barcodeErrors

    Code
    SeekDeep extractor --fastq example.fastq --id ids.txt --barcodeErrors 2

    Renaming of sequence ids

    The names given to the sequences are sometimes annoying so adding the --rename flag will rename the sequence with what primer/mid they were found with

    Code
    SeekDeep extractor --fastq example.fastq --id ids.txt  --rename

    A file will be created called renameKey.tab.txt which contains the name conversion.

    Code
    cat renameKey.tab.txt
    originalName    newName
    JUW96:00012:00129   PFAMA1MID01.00
    JUW96:00013:00133   PFAMA1MID01.01
    JUW96:00017:00147   PFAMA1MID01.02_Comp
    JUW96:00018:00120   PFAMA1MID01.03
    JUW96:00020:00136   PFAMA1MID01.04_Comp
    JUW96:00029:00127   PFAMA1MID01.05
    JUW96:00032:00129   PFAMA1MID01.06
    JUW96:00033:00147   PFAMA1MID01.07_Comp
    JUW96:00035:00137   PFAMA1MID01.08_Comp

    Changing out directory name

    To change the default directory name use the --dout flag. SeekDeep will never overwrite a directory if it already exists and will fail and quit if it tries to create a directory that exists.

    Code
    SeekDeep extractor --fastq example.fastq --id ids.txt --dout extractionDir

    The dout option also understand the key work TODAY to mean to insert the current date and time there instead though this means a output directory name can never have TODAY all in caps in it

    Code
    SeekDeep extractor --fastq example.fastq --id ids.txt --dout extractionDir_TODAY
    Source Code
    :::{.callout-note}
    # extractor's objectives    
      
    The main purpose of extractor is to take raw data and demultiplex by sample barcodes if present and by primer pairs.  It at most requires two arguments: an input file which can be fastq (`--fastq`) or fasta (`--fasta`) and an ID file (`--id`) that supplies primer pairs and barcodes if present. It will also do some quality control as well for length and quality scores. 
    
    :::
    
    
    # Paired End Extraction  
    There is a separate extractor command for extracting paired end data and is simply called `SeekDeep extractorPairedEnd` and shares most of the command line options with regular single end `SeekDeep extractor`
    
    # Getting usage command line  
      
    Calling `-help` will give a help message on running the program 
    ```{r, engine='bash',comment="",eval=FALSE}
    SeekDeep extractor --help 
    SeekDeep extractorPairedEnd --help 
    ```
    Also all flags in SeekDeep are case insensitive and so all the following would have the same results 
    ```{r, engine='bash',comment="",eval=FALSE}
    SeekDeep extractor --help 
    SeekDeep extractor --HELP
    SeekDeep extractor --HeLP
    SeekDeep extractor --HeLp
    SeekDeep extrActor --HeLP
    SeekDeep ExtraCtor --HeLp
    SeekDeep EXTRACTOR --HeLp
    ```
    ```{r, engine='bash',comment="",eval=FALSE, echo=FALSE}
    SeekDeep extractor --getFlags
    ```
    
    ```{r, engine='bash',comment="",highlight=TRUE, eval=FALSE, echo=FALSE}
    SeekDeep extractor --getFlags | gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g" | head -20
    echo ...
    echo ...
    
    ```
    
    # Format of id file  
      
    The id file is a tab delimited file that contains both primer pair info and optional barcode information  
    Primer information starts with a header line that must start with either `gene` or `target` and have two additional columns for forward and reverse primers, the line can start with either gene or target.   
    ```{r,engine='bash',comment="", echo=FALSE}
    head -3 ../extraFiles/pfama1_csp_ids.tab.txt 
    ```  
    ## Primers  
      
    The primers can contain [ambigious bases](http://www.dnabaser.com/articles/IUPAC%20ambiguity%20codes.html) (Y,R,etc.) and should be all upper case  
    
    ### Forward Primer  
      
    Forward primer should be in the 5\`-3\` direction.  
    
    ### Reverse Primer  
      
    The reverse primer should in in the 5\`-3\` direction as well.  
    
    ## MIDs
      
    If data is multiplexed the MID name and corresponding barcode information comes after the primer information.  Barcode information needs to start with a header `id  barcode` which the `id` is the important name for parsing.  Mids should be in the direction they would be found in if reads are found in the direction of the forward primer.  It is always assumed that the mid and primer set up is MID-PRIMER.  If this is not the case for your data contact Nick Hathaway (nicholas.hathaway@umassmed.edu) to see if he can add options for accommodating your data structure.  
    ```{r,engine='bash',comment="", echo=FALSE}
    cat ../extraFiles/pfama1_csp_ids.tab.txt
    ```  
    
    
    # Output files
      
    A directory will be created for the output of extractor, this defaults to the name of the file plus the work extractor and then the current date and time of the when the command was run.  This can be changed with the `--dout` flag. In this directory will be the sequences and several files reporting on how the extraction when.  
    
    *  **extractionProfile.tab.txt** - Reports stats on extraction per mid and primer pair  
    *  **extractionStats.tab.txt** - Reports stats on the extraction overall  
    *  **runLog_extractor.txt** - Contains the date and where the command was run on and what it was and total run time  
    *  **filteredOff** - A directory containing the reads that didn't meet filtering criteria or match the input primers/mids
    
    Example of extractionProfile.tab.txt
    ```{r engine='bash', echo=FALSE}
    cat extraFiles/exampleExtraction/extractionProfile.tab.txt | column -t 
    ```  
    Example of extractionStats.tab.txt  
    ```{r engine='bash', echo=FALSE}
    cat extraFiles/exampleExtraction/extractionStats.tab.txt | column -t 
    ```
    
    The output files will be named by the using the name given in the gene column and the name in id column if the data is multiplexed.  
    ```{r engine='bash', echo=TRUE, eval=FALSE}
    ls ./
    ```  
    ```{r engine='bash', echo=FALSE}
    ls ../extraFiles/exampleExtraction/
    ```
    
    # Examples  
    ## Required files 
    `SeekDeep extractor` and `SeekDeep extractorPairedEnd` require an input sequence file and an identifier file.  
    
    ### Different input formats  
      
    
    #### Fastq Input  
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fastq example.fastq --id ids.txt
    SeekDeep extractor --fastqgz example.fastq.gz --id ids.txt
    ```
    #### Fasta Input  
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fasta example.fasta --id ids.txt
    ```
    #### Fasta and Qual Input  
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fasta example.fasta --qual example.fasta.qual --id ids.txt
    #or
    SeekDeep extractor --stub example --id ids.txt
    ```
    
    ### Paired End  
    For paired end the only options are fastq or fastqgz and both mates have to be indicated. This also assumes that R1 sequences are in the 5\`-3\` direction of the top strand, and R2 are in the 5\`-3\` direction of the bottom strand.   
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractorPairedEnd --fastq1 example_R1.fastq --fastq2 example_R2.fastq --id ids.txt
    SeekDeep extractorPairedEnd --fastq1gz example_R1.fastq.gz --fastq2gz example_R2.fastq.gz --id ids.txt
    ```
    
    
    ### Multiplexed and multiple primer pairs
      
    
    #### Multiplex  
      
    Multiplexing simply requires having MID/barcodes in id file, a header line is required that starts with `id` to indicate to `SeekDeep extractor` that barcodes are starting.   
    ```{r engine='bash', eval=F}
    cat id_file.tab.txt
    ```
    ```{r engine='bash', eval=T, echo=FALSE}
    cat ../extraFiles/id_file.tab.txt
    ```
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fastq example.fastq --id id_file.tab.txt --multiplex
    ```
    
    #### Multiple primer pairs  
      
    Multiple primer pairs is done simply by adding another primer info line, this can also be combined with multiplexing
    ```{r engine='bash', eval=F}
    cat multipleGenePairs.id.txt
    ```
    ```{r engine='bash', eval=T, echo=FALSE}
    cat ../extraFiles/multipleGenePairs.id.txt
    ```
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fasta example.fasta --id multipleGenePairs.id.txt 
    ```
    
    ### Paired End 
    For paired reads an additional file is required that indicates if there is overlap between the mates. There are three possible overlap scenarios: noOverLap, r1EndsInR2 (normal overlap goal), and r1BeginsInR2 (happens with read through). If there is overlap `SeekDeep extractorPairedEnd` will stitch together the mates and if there is no overlap it will extract and have a _R1 and _R2 files for the target. The different overlap scenarios are required as there are many ways to generate artifacts in PCR (unspecific amplification, primer dimers, etc) that could lead to the wrong overlap and therefore reads that overlap in the way that they are expected to are given in the final output files and the rest assumed to be artifact. See [Illumina Paired Info Page](illumina_paired_info.html) for a diagram on how sequences overlap.  
    
    The overlap status of the targets is given by a file with two columns, `target` and `status`. The order of columns does not matter and the case of the status column does not matter, the target column does. The name given in the target column must match with the names given in the id file.  
    
    ```{bash, eval = F}
    cat overlapStatuses.txt
    ```
    
    ```{bash, echo = F}
    cat ../extraFiles/refSeqs/forSeekDeep/overlapStatuses.txt
    ```
    
    ```{bash, eval = F}
    SeekDeep extractor --fastq1 example_R1.fastq --fastq2 example_R2.fastq --id multipleGenePairs.id.txt --overlapStatusFnp overlapStatuses.txt 
    ```
    
    ## Filtering parameters
      
    
    ### Length
    Reads are filtered on expected length by choosing a minimum length (`--minLen`) and a maximum length (`--maxLen`), these default to be within 20% of the median input read lengths  
    
    #### Minimum Length
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fastq example.fastq --id ids.txt --minLen 150
    ```
    #### Maximum Length
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fastq example.fastq --id ids.txt --minLen 150 --maxLen 300
    ```
    You can also supply a file with len cutoffs which is especially useful for when you have different expected lengths for different targets, the file needs to contain at least three columns with one column named target which contain the name of the target region which must match the target name in the `--id` file and two columns `minlen` and `maxlen` which contain the min and max length for the target.  
    This is supplied with the `--lenCutOffs` flags.  
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fastq example.fastq --id ids.txt --lenCutOffs lengthCutOffsForTargets.tab.txt 
    ```
    ```{bash, eval = F}
    cat lengthCutOffsForTargets.tab.txt
    ```
    ```{bash, echo = F}
    cat ../extraFiles/refSeqs/forSeekDeep/lenCutOffs.txt
    ```
    
    ### Quality
    When quality values are supplies they can be used to filter reads as well, there are two options for quality filtering.  The default is to use the sliding quality window threshold method, this is good for Ion Torrent, 454, and PacBio.  The second quality option is good for Illumina
    
    #### Siding window average threshold
    By default a sliding window of size 50 and stepping by every 5 bases and checking to see if the average quality of those 50 bases are above a threshold of 25.  The default values can be changed by using the `--qualWindow` flag and supplying three values separated by commas which are the window size, window step, and window threshold.  So the below example would use a window size of 50, a step of 5, and a average threshold of 20
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fastq example.fastq --id ids.txt --qualWindow 50,5,20
    ```
    Also the default behavior is to throw out any read where this ever fails and this can be changed to trimming the read at the failed window instead using the `-qualWindowTrim` flag.  The minimum length filter is applied after this step so if the trim falls below the min len it will still be filtered out  
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fastq example.fastq --id ids.txt --qualWindow 50,5,20 --qualWindowTrim
    ```
    
    #### Quality fraction above a threshold
    A second option for filtering using quality scores is using the distribution of a reads quality scores.  This is done by choosing a quality score cut off and checking to see if a certain fraction of quality scores are above a fraction cut off.  This is done by using the `-qualCheck` and the `qualCheckCutOff` flags.  So the below example would throw out any reads where less than 75% of their quality scores was above 25
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fastq example.fastq --id ids.txt --qualCheck 25 --qualCheckCutOff .75
    ```
    ### Dry run quality testing
    Since extraction might take some time, it might be nice to see how many reads you would loose with different setting, so a command was added to SeekDeep to do a dry run to output the percentage of reads you would loose if certain quality parameters were used, this is called **SeekDeep dryRunQualityFiltering** and the arguments are similar to the arguments above.  
    
    #### Checking quality window 
    ```{r engine='bash', eval=FALSE}
    SeekDeep dryRunQualityFiltering --fastq example.fastq  --qualWindow 50,5,25
    ```
    ```{r engine='bash', eval=T, echo=FALSE}
    SeekDeep dryRunQualityFiltering --fastq Multiplex_IT_Tutorial_Materials/IonTorrent1.fastq  --qualWindow 50,5,25 | tail -13 | gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g"
    
    ```
    ```{r engine='bash', eval=FALSE}
    SeekDeep dryRunQualityFiltering --fastq example.fastq  --qualWindow 50,5,20
    ```
    ```{r engine='bash', eval=T, echo=FALSE}
    SeekDeep dryRunQualityFiltering --fastq Multiplex_IT_Tutorial_Materials/IonTorrent1.fastq --qualWindow 50,5,20 | tail -13 | gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g"
    ```
    ```{r engine='bash', eval=FALSE}
    SeekDeep dryRunQualityFiltering --fastq example.fastq  --qualWindow 50,5,18
    ```
    ```{r engine='bash', eval=T, echo=FALSE}
    SeekDeep dryRunQualityFiltering --fastq Multiplex_IT_Tutorial_Materials/IonTorrent1.fastq  --qualWindow 50,5,18 | tail -13 |gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g"
    
    ```
    
    #### checking qualcheck 
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fastq example.fastq --id ids.txt --qualCheck 25 --qualCheckCutOff .75
    ```
    ```{r engine='bash', eval=T, echo=FALSE}
    SeekDeep dryRunQualityFiltering --fastq Multiplex_IT_Tutorial_Materials/IonTorrent1.fastq  --qualCheck 25 --qualCheckCutOff .75 | tail -13 |gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g"
    ```
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fastq example.fastq --id ids.txt --qualCheck 20 --qualCheckCutOff .75
    ```
    ```{r engine='bash', eval=T, echo=FALSE}
    SeekDeep dryRunQualityFiltering --fastq Multiplex_IT_Tutorial_Materials/IonTorrent1.fastq  --qualCheck 20 --qualCheckCutOff .75 | tail -13 |gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g"
    ```
    
    ### Primers  
      
    
    There are various filtering parameters that can be applied to the presence of primers.  This includes percent of primer found, the number of mismatches to the primer, and not searching for the primer at all.  See SeekDeep extractor -help for more details and defaults
    
    #### Reverse Primer 
    Looking for primers can be turned off and simple filtering can be done, this requires the flag `--noPrimers` and that the ID file has just one primer line so that a target name can be given to the out files but the forward and reverse primer columns are ignore.  
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fastq example.fastq --id ids.txt --noPrimers 
    ```
    
    ## Additional Options
      
    
    ### Looking in the reverse complement direction
    Depending on the library set up reads are can be found in two directions.  Reads found in the reverse complement direction will be reverse complemented and their name will be marked with `_Comp` so all sequences are in the same direction. There are separate flags for searching MIDs in both directions (--checkRevComplementForMids) and for searching for primers in the both directions (--checkRevComplementForPrimers). And depending on library prep again you might have to combine the two.    
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fasta example.fasta --id ids.txt  --checkRevComplementForMids
    ```
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fasta example.fasta --id ids.txt  --checkRevComplementForPrimers
    ```
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fasta example.fasta --id ids.txt  --checkRevComplementForMids --checkRevComplementForPrimers
    ```
    
    ### Looking for barcodes/primers at various locations
      
    
    Barcodes and primers aren't always located at the very beginning of the sequences and the search region has to be expanded which can be done using the `--midWithinStart` flag for mids and `--primerWithinStart`, the below example would check for the barcodes and primers for start sites within the first 10 bases of the start of the sequence (the primer/MID just has to start within this many bases and not be contained completely with this region, e.g. the primer could be found at base 10 but extend to base 30 and would be caught with this flag). When data is multiplexed the primer within start parameter is the number of bases within the found MID and not the start of the sequence.      
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fastq example.fastq --id ids.txt --midWithinStart 10
    ```
    
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fastq example.fastq --id ids.txt --primerWithinStart 10
    ```
    
    And again, depending on library prep you might need to combine these two flags. For instance, if you had say a variable number of bases of up to 4 in front of your MIDs and then an additional variable number of bases in between your MIDs and primers (lets say up to 10).
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fastq example.fastq --id ids.txt --midWithinStart 4 --primerWithinStart 10
    ```
    
    It is advised to keep these numbers low as a high number will greatly increase the search space and could lead to many false positives.  
    
    ## Barcodes 
    ### Dual barcodes 
    Dual barcoding schemes can be done by adding another column to the barcode portion of the input ID file. The first column is assumed to be associated with the forward target primer and the second column is assumed to be associated with the   
    
    
    ![Supply specific length cut offs](../images/dual_barcoding_scheme.jpg)
    
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fastq example.fastq --id ids.txt
    ```
    
    
    
    ### Allowing errors in barcodes 
    You can allow mismatches in barcodes using `--barcodeErrors`
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fastq example.fastq --id ids.txt --barcodeErrors 2
    ```
    
    
    ### Renaming of sequence ids
      
    
    The names given to the sequences are sometimes annoying so adding the `--rename` flag will rename the sequence with what primer/mid they were found with
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fastq example.fastq --id ids.txt  --rename
    ```
    A file will be created called `renameKey.tab.txt` which contains the name conversion.
    ```{r engine='bash', eval=FALSE}
    cat renameKey.tab.txt
    ```
    ```{r engine='bash', eval=T, echo=FALSE}
    cat ../extraFiles/exampleRenameFile.txt
    ```
    
    
    ### Changing out directory name 
      
    
    To change the default directory name use the `--dout` flag.  SeekDeep will never overwrite a directory if it already exists and will fail and quit if it tries to create a directory that exists.  
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fastq example.fastq --id ids.txt --dout extractionDir
    ```
    The dout option also understand the key work TODAY to mean to insert the current date and time there instead though this means a output directory name can never have TODAY all in caps in it
    ```{r engine='bash', eval=FALSE}
    SeekDeep extractor --fastq example.fastq --id ids.txt --dout extractionDir_TODAY
    ```