SeekDeep
  • Home
  • Installing
    • Mac OS
    • Ubuntu
    • Windows
    • Vagrant/virtual image (any system)
  • Code
    • Github
  • Usages

    • Pipeline
    • extractor/extractorPairedEnd
    • makeSampleDirectories
    • qluster
    • processClusters
    • popClusteringViewer

    • Pipeline Wrapper
    • setupTarAmpAnalysis

    • Utilities
    • genTargetInfoFromGenomes
    • SeekDeep control mixture benchmarking
    • SeekDeep Variant Calling
  • Tutorials

    • Single End
    • Ion Torrent with MIDs

    • Illumina Paired End
    • Paired End No MIDs/Barcodes
    • Paired End With MIDs/Barcodes
  • Misc Info
    • Illumina Paired End Info
  • References
    • versions
    • References

    Contents

    • Getting usage command line
    • Format of set up file
      • Dual replicates
      • Single replicates
      • mixed replicates
    • Output directory name
    • Output files
      • Location file for qluster

    • Show All Code
    • Hide All Code

    • View Source
    makeSampleDirectories’s Purpose

    The intent of makeSampleDirectories is to set up the directory tree structure needed by SeekDeep processClusters and to ease the process of putting output files into the tree. SeekDeep makeSampleDirectories needs two arguments given by the --file and --dout flags. The --file supplies the set up file needed (explained below) and the --dout flag names the directory that will be created (it will not over write already existing directories). It has mostly been developed for use with multiplex data.

    Getting usage command line

    Just typing the name of the program will give a help message on running the program

    Code
    SeekDeep makeSampleDirectories 
    makeSampleDirectoires 
    Set up a directory tree for processClusters 
    Commands, order not necessary, flags are case insensitive 
    Required commands 
    --file [option], name of the file of sample names to read in 
    --dout [option], name of the main directory to create 
    File should be tab delimited and a few examples are below 
    File should have at least three columns 
    Where first column is the name of the index or sff file used, second column is 
        the sample names, and all following columns are the MIDs for that samples 
        samples 
    Example with two replicates and two separate master indexes
    1   090-00  MID01   MID02
    1   090-24  MID03   MID04
    1   090-48  MID05   MID06
    ...
    ...

    Also calling -help will do the same

    Code
    SeekDeep makeSampleDirectories --help 

    Also all flags in SeekDeep are case insensitive and so all the following would have the same results

    Code
    SeekDeep makeSampleDirectories --help 
    SeekDeep MakeSampleDirectories --HELP
    SeekDeep makeSampleDirectories --HeLP
    SeekDeep MAKESAMPLEDIRECTORIES --HeLp
    SeekDeep makeSampleDirectories --HeLP
    SeekDeep makesampledirectories --HeLp
    SeekDeep makesampleDirectories --HeLp

    Format of set up file

    The set up file contains at least three columns. The first column is a identifier for the sequence file that contains the sample sequence’s data. The second column is the name of the sample. The third column is the name of the MID for that sample and each additional column is another replicate for that sample in the that file. Any line that starts with a # or is a blank line will be ignored.

    Dual replicates

    Below would be for example of a sequencing experiment that included 20 samples and each sample contained 2 PCR replicates and the sequencing itself was done on two lanes or cells (eg. Two Illumina Lanes or two Different Ion torrent runs)

    Code
    cat exampleSampleNames.tab.txt
    Code
    SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters

    Single replicates

    Below would be for example of a sequencing experiment that included 20 samples and each sample contained only 1 PCR replicates and the sequencing itself was done on two lanes or cells (eg. Two Illumina Lanes or two Different Ion torrent runs)

    Code
    cat exampleSampleNamesSingleReps.tab.txt
    Code
    SeekDeep makeSampleDirectories --file exampleSampleNamesSingleReps.tab.txt --dout filesForProcessClusters

    mixed replicates

    Now you might have a experiment like the one with dual replicates but maybe you only one replicate for a couple of samples maybe due to poor amplification or someone other reason. This can also be done simply by mixing the two above formats.

    Code
    cat exampleSampleNamesMixed.tab.txt
    Code
    SeekDeep makeSampleDirectories --file exampleSampleNamesMixed.tab.txt --dout filesForProcessClusters

    Output directory name

    The output directory will be what is given by the -dout flag. This will never overwrite an existing directory. The -dout flag also will interpret the word TODAY in all caps to mean to insert the current date and time instead.

    Code
    SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY

    Output files

    A master directory will be built using the name given by -dout. In this directory will be a directory tree that is need by SeekDeep processClusters where there is a top directory containing directories for the samples in the analysis and each sample directory contains all replicate directories for that sample.

    Code
    SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY
    cd filesForProcessClusters_2015-06-14_12.22
    tree

    Location file for qluster

    Another directory, called locationByIndex, will contain a file for intended use with SeekDeep qluster to help direct output to this directory. This is done by giving these files to qluster using the -additionalOut flag with qluster, see here for details on this flag. The idea here is each input sequence file will contain mids and once the files have been extracted by SeekDeep extractor will be in separate files. The above example had two seq input files. When running qluster on the first file you give the location file for that file.
    Here are the files from the above example

    Code
    SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY
    ls filesForProcessClusters_2015-06-14_12.22/locationByIndex

    Each file contains two columns, the first is the MID name and the second is the location where the output of qluster should go for the clustering of that MID.

    Code
    SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY
    cat filesForProcessClusters_2015-06-14_12.22/locationByIndex/1.tab.txt

    Now when running qluster on files on the results from the extraction of the input files give the appropriate location file. Extraction being done with the following file

    Code
    cat pfama1_ids.tab.txt
    gene    forward reverse
    PFAMA1  CAGGGAAATGTCCAGTATT CTTGAACATAAAGTCAATTC
    id  barcode
    MID01   ACGAGTGCGT
    MID02   ACGCTCGACA
    MID03   AGACGCACTC
    MID04   AGCACTGTAG
    MID05   ATCAGACACG
    MID06   ATATCGCGAG
    MID07   CGTGTCTCTA
    ...
    ...
    Code
    SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters
    SeekDeep extractor --fastq input1.fastq --id ids.txt --dout extraction1
    cd extraction1
    SeekDeep qluster --fastq PFAMA1MID01.fastq --par pars.tab.txt --additionalOut ../filesForProcessClusters/locationByIndex/1.tab.txt
    ...
    cd ..
    SeekDeep extractor --fastq input2.fastq --id ids.txt --dout extraction2
    cd extraction2
    SeekDeep qluster --fastq PFAMA1MID01.fastq --par pars.tab.txt --additionalOut ../filesForProcessClusters/locationByIndex/2.tab.txt
    ...
    #now after running qluster on all the files from extraction things are already ready for processClusters
    cd ../filesForProcessClusters
    SeekDeep processClusters --fastq output.fastq --par pars.txt
    Source Code
    :::{.callout-note}  
    # makeSampleDirectories's Purpose  
      
    The intent of makeSampleDirectories is to set up the directory tree structure needed by [SeekDeep processClusters](processClusters_usage.html) and to ease the process of putting output files into the tree.  SeekDeep makeSampleDirectories needs two arguments given by the `--file` and `--dout` flags.  The `--file` supplies the set up file needed (explained below) and the `--dout` flag names the directory that will be created (it will not over write already existing directories).  It has mostly been developed for use with multiplex data.  
    :::
    
    # Getting usage command line  
      
    Just typing the name of the program will give a help message on running the program  
    ```{r, engine='bash',comment="",eval=FALSE}
    SeekDeep makeSampleDirectories 
    ```
    
    ```{r, engine='bash',comment="",highlight=TRUE, echo=FALSE}
    SeekDeep makeSampleDirectories | gsed -r "s/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[m|K]//g" | head -15
    echo ...
    echo ...
    
    ```
    Also calling `-help` will do the same 
    ```{r, engine='bash',comment="",eval=FALSE}
    SeekDeep makeSampleDirectories --help 
    ```
    Also all flags in SeekDeep are case insensitive and so all the following would have the same results 
    ```{r, engine='bash',comment="",eval=FALSE}
    SeekDeep makeSampleDirectories --help 
    SeekDeep MakeSampleDirectories --HELP
    SeekDeep makeSampleDirectories --HeLP
    SeekDeep MAKESAMPLEDIRECTORIES --HeLp
    SeekDeep makeSampleDirectories --HeLP
    SeekDeep makesampledirectories --HeLp
    SeekDeep makesampleDirectories --HeLp
    ```
    ```{r, engine='bash',comment="",eval=FALSE, echo=FALSE}
    SeekDeep makeSampleDirectories --getFlags
    ```
    
    ```{r, engine='bash',comment="",highlight=TRUE, eval=FALSE, echo=FALSE}
    SeekDeep makeSampleDirectories --getFlags | gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g" | head -20
    echo ...
    echo ...
    
    ```
    
    
    # Format of set up file
    The set up file contains at least three columns.  The first column is a identifier for the sequence file that contains the sample sequence's data.  The second column is the name of the sample.  The third column is the name of the MID for that sample and each additional column is another replicate for that sample in the that file.  Any line that starts with a # or is a blank line will be ignored.  
    
    ## Dual replicates
       
    Below would be for example of a sequencing experiment that included 20 samples and each sample contained 2 PCR replicates and the sequencing itself was done on two lanes or cells (eg. Two Illumina Lanes or two Different Ion torrent runs)
    ```{r, engine='bash',comment="",eval=FALSE}
    cat exampleSampleNames.tab.txt
    ```
    ```{r, engine='bash',comment="",eval=F, echo=FALSE}
    cat ../extraFiles/exampleSampleNames.tab.txt
    
    ```
    ```{r, engine='bash',comment="",eval=FALSE}
    SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters
    ```
    ```{r, engine='bash',comment="",eval=F, echo=FALSE}
    rm -r filesForProcessClusters
    SeekDeep makeSampleDirectories --file ../extraFiles/exampleSampleNames.tab.txt --dout filesForProcessClusters --overWriteDir | gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g" | head -20
    echo ...
    echo ...
    
    ```
    
    ## Single replicates
      
    Below would be for example of a sequencing experiment that included 20 samples and each sample contained only 1 PCR replicates and the sequencing itself was done on two lanes or cells (eg. Two Illumina Lanes or two Different Ion torrent runs)
    ```{r, engine='bash',comment="",eval=FALSE}
    cat exampleSampleNamesSingleReps.tab.txt
    ```
    ```{r, engine='bash',comment="",eval=F, echo=FALSE}
    cat ../extraFiles/exampleSampleNamesSingleReps.tab.txt
    
    ```
    ```{r, engine='bash',comment="",eval=FALSE}
    SeekDeep makeSampleDirectories --file exampleSampleNamesSingleReps.tab.txt --dout filesForProcessClusters
    ```
    ```{r, engine='bash',comment="",eval=F, echo=FALSE}
    rm -r filesForProcessClusters
    SeekDeep makeSampleDirectories --file ../extraFiles/exampleSampleNamesSingleReps.tab.txt --dout filesForProcessClusters --overWriteDir | gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g" | head -20
    echo ...
    echo ...
    
    ```
    
    ## mixed replicates
      
    Now you might have a experiment like the one with dual replicates but maybe you only one replicate for a couple of samples maybe due to poor amplification or someone other reason.  This can also be done simply by mixing the two above formats.
    ```{r, engine='bash',comment="",eval=FALSE}
    cat exampleSampleNamesMixed.tab.txt
    ```
    ```{r, engine='bash',comment="",eval=F, echo=FALSE}
    cat ../extraFiles/exampleSampleNamesMixed.tab.txt
    
    ```
    ```{r, engine='bash',comment="",eval=FALSE}
    SeekDeep makeSampleDirectories --file exampleSampleNamesMixed.tab.txt --dout filesForProcessClusters
    ```
    ```{r, engine='bash',comment="",eval=F, echo=FALSE}
    rm -r filesForProcessClusters
    SeekDeep makeSampleDirectories --file ../extraFiles/exampleSampleNamesMixed.tab.txt --dout filesForProcessClusters --overWriteDir | gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g" | head -20
    echo ...
    echo ...
    
    ```
    
    # Output directory name
      
    The output directory will be what is given by the `-dout` flag.  This will never overwrite an existing directory.  The `-dout` flag also will interpret the word TODAY in all caps to mean to insert the current date and time instead.
    ```{r, engine='bash',comment="",eval=FALSE}
    SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY
    ```
    
    # Output files
      
    A master directory will be built using the name given by `-dout`.  In this directory will be a directory tree that is need by [SeekDeep processClusters](processClusters_usage.html) where there is a top directory containing directories for the samples in the analysis and each sample directory contains all replicate directories for that sample.  
    ```{r, engine='bash',comment="",eval=FALSE}
    SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY
    cd filesForProcessClusters_2015-06-14_12.22
    tree
    ```
    ```{r, engine='bash',comment="",eval=F, echo=FALSE}
    rm -r filesForProcessClusters_201*
    SeekDeep makeSampleDirectories --file ../extraFiles/exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY --overWriteDir >temp
    cd filesForProcessClusters_201*
    tree -A | head -20
    echo ...
    echo ...
    ```
    
    ## Location file for qluster
    Another directory, called locationByIndex, will contain a file for intended use with [SeekDeep qluster](qluster_usage.html) to help direct output to this directory.  This is done by giving these files to qluster using the `-additionalOut` flag with qluster, see [here](qluster_usage.html#additional-alternative-directory-output) for details on this flag.  The idea here is each input sequence file will contain mids and once the files have been extracted by [SeekDeep extractor](extractor_usage.html) will be in separate files.  The above example had two seq input files.  When running qluster on the first file you give the location file for that file.  
    Here are the files from the above example 
    ```{r, engine='bash',comment="",eval=FALSE}
    SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY
    ls filesForProcessClusters_2015-06-14_12.22/locationByIndex
    ```
    ```{r, engine='bash',comment="",eval=F, echo=FALSE}
    #rm -r filesForProcessClusters_2017*
    SeekDeep makeSampleDirectories --file ../extraFiles/exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY --overWriteDir >temp
    ls filesForProcessClusters_201*/locationByIndex
    ```
    Each file contains two columns, the first is the MID name and the second is the location where the output of qluster should go for the clustering of that MID.  
    ```{r, engine='bash',comment="",eval=FALSE}
    SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY
    cat filesForProcessClusters_2015-06-14_12.22/locationByIndex/1.tab.txt
    ```
    ```{r, engine='bash',comment="",eval=F, echo=FALSE}
    rm -fr filesForProcessClusters_201*
    SeekDeep makeSampleDirectories --file ../extraFiles/exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY  --overWriteDir >temp
    cat filesForProcessClusters_201*/locationByIndex/1.tab.txt | gsed 's/\/.*SeekDeep/\/home\/user/g' | sort
    rm -fr filesForProcessClusters_201*
    ```
    
    
    Now when running qluster on files on the results from the extraction of the input files give the appropriate location file.  Extraction being done with the following file
    ```{r, engine='bash',comment="",eval=F, echo=T}
    cat pfama1_ids.tab.txt
    ```
    ```{r, engine='bash',comment="",eval=T, echo=FALSE}
    cat ../extraFiles/pfama1_ids.tab.txt | head -10
    echo ...
    echo ...
    ```
    
    ```{r, engine='bash',comment="",eval=FALSE}
    SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters
    SeekDeep extractor --fastq input1.fastq --id ids.txt --dout extraction1
    cd extraction1
    SeekDeep qluster --fastq PFAMA1MID01.fastq --par pars.tab.txt --additionalOut ../filesForProcessClusters/locationByIndex/1.tab.txt
    ...
    cd ..
    SeekDeep extractor --fastq input2.fastq --id ids.txt --dout extraction2
    cd extraction2
    SeekDeep qluster --fastq PFAMA1MID01.fastq --par pars.tab.txt --additionalOut ../filesForProcessClusters/locationByIndex/2.tab.txt
    ...
    #now after running qluster on all the files from extraction things are already ready for processClusters
    cd ../filesForProcessClusters
    SeekDeep processClusters --fastq output.fastq --par pars.txt
    ```