makeSampleDirectories’s Purpose

The intent of makeSampleDirectories is to set up the directory tree structure needed by SeekDeep processClusters and to ease the process of putting output files into the tree. SeekDeep makeSampleDirectories needs two arguments given by the --file and --dout flags. The --file supplies the set up file needed (explained below) and the --dout flag names the directory that will be created (it will not over write already existing directories). It has mostly been developed for use with multiplex data.

Getting usage command line

Just typing the name of the program will give a help message on running the program

Code

SeekDeep makeSampleDirectories

makeSampleDirectoires 
Set up a directory tree for processClusters 
Commands, order not necessary, flags are case insensitive 
Required commands 
--file [option], name of the file of sample names to read in 
--dout [option], name of the main directory to create 
File should be tab delimited and a few examples are below 
File should have at least three columns 
Where first column is the name of the index or sff file used, second column is 
    the sample names, and all following columns are the MIDs for that samples 
    samples 
Example with two replicates and two separate master indexes
1   090-00  MID01   MID02
1   090-24  MID03   MID04
1   090-48  MID05   MID06
...
...

Also calling -help will do the same

Code

SeekDeep makeSampleDirectories --help

Also all flags in SeekDeep are case insensitive and so all the following would have the same results

Code

SeekDeep makeSampleDirectories --help 
SeekDeep MakeSampleDirectories --HELP
SeekDeep makeSampleDirectories --HeLP
SeekDeep MAKESAMPLEDIRECTORIES --HeLp
SeekDeep makeSampleDirectories --HeLP
SeekDeep makesampledirectories --HeLp
SeekDeep makesampleDirectories --HeLp

Format of set up file

The set up file contains at least three columns. The first column is a identifier for the sequence file that contains the sample sequence’s data. The second column is the name of the sample. The third column is the name of the MID for that sample and each additional column is another replicate for that sample in the that file. Any line that starts with a # or is a blank line will be ignored.

Dual replicates

Below would be for example of a sequencing experiment that included 20 samples and each sample contained 2 PCR replicates and the sequencing itself was done on two lanes or cells (eg. Two Illumina Lanes or two Different Ion torrent runs)

Code

cat exampleSampleNames.tab.txt

Code

SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters

Single replicates

Below would be for example of a sequencing experiment that included 20 samples and each sample contained only 1 PCR replicates and the sequencing itself was done on two lanes or cells (eg. Two Illumina Lanes or two Different Ion torrent runs)

Code

cat exampleSampleNamesSingleReps.tab.txt

Code

SeekDeep makeSampleDirectories --file exampleSampleNamesSingleReps.tab.txt --dout filesForProcessClusters

mixed replicates

Now you might have a experiment like the one with dual replicates but maybe you only one replicate for a couple of samples maybe due to poor amplification or someone other reason. This can also be done simply by mixing the two above formats.

Code

cat exampleSampleNamesMixed.tab.txt

Code

SeekDeep makeSampleDirectories --file exampleSampleNamesMixed.tab.txt --dout filesForProcessClusters

Output directory name

The output directory will be what is given by the -dout flag. This will never overwrite an existing directory. The -dout flag also will interpret the word TODAY in all caps to mean to insert the current date and time instead.

Code

SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY

Output files

A master directory will be built using the name given by -dout. In this directory will be a directory tree that is need by SeekDeep processClusters where there is a top directory containing directories for the samples in the analysis and each sample directory contains all replicate directories for that sample.

Code

SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY
cd filesForProcessClusters_2015-06-14_12.22
tree

Location file for qluster

Another directory, called locationByIndex, will contain a file for intended use with SeekDeep qluster to help direct output to this directory. This is done by giving these files to qluster using the -additionalOut flag with qluster, see here for details on this flag. The idea here is each input sequence file will contain mids and once the files have been extracted by SeekDeep extractor will be in separate files. The above example had two seq input files. When running qluster on the first file you give the location file for that file.
Here are the files from the above example

Code

SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY
ls filesForProcessClusters_2015-06-14_12.22/locationByIndex

Each file contains two columns, the first is the MID name and the second is the location where the output of qluster should go for the clustering of that MID.

Code

SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY
cat filesForProcessClusters_2015-06-14_12.22/locationByIndex/1.tab.txt

Now when running qluster on files on the results from the extraction of the input files give the appropriate location file. Extraction being done with the following file

Code

cat pfama1_ids.tab.txt

gene    forward reverse
PFAMA1  CAGGGAAATGTCCAGTATT CTTGAACATAAAGTCAATTC
id  barcode
MID01   ACGAGTGCGT
MID02   ACGCTCGACA
MID03   AGACGCACTC
MID04   AGCACTGTAG
MID05   ATCAGACACG
MID06   ATATCGCGAG
MID07   CGTGTCTCTA
...
...

Code

SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters
SeekDeep extractor --fastq input1.fastq --id ids.txt --dout extraction1
cd extraction1
SeekDeep qluster --fastq PFAMA1MID01.fastq --par pars.tab.txt --additionalOut ../filesForProcessClusters/locationByIndex/1.tab.txt
...
cd ..
SeekDeep extractor --fastq input2.fastq --id ids.txt --dout extraction2
cd extraction2
SeekDeep qluster --fastq PFAMA1MID01.fastq --par pars.tab.txt --additionalOut ../filesForProcessClusters/locationByIndex/2.tab.txt
...
#now after running qluster on all the files from extraction things are already ready for processClusters
cd ../filesForProcessClusters
SeekDeep processClusters --fastq output.fastq --par pars.txt

:::{.callout-note} # makeSampleDirectories's Purpose The intent of makeSampleDirectories is to set up the directory tree structure needed by [SeekDeep processClusters](processClusters_usage.html) and to ease the process of putting output files into the tree. SeekDeep makeSampleDirectories needs two arguments given by the `--file` and `--dout` flags. The `--file` supplies the set up file needed (explained below) and the `--dout` flag names the directory that will be created (it will not over write already existing directories). It has mostly been developed for use with multiplex data. ::: # Getting usage command line Just typing the name of the program will give a help message on running the program ```{r, engine='bash',comment="",eval=FALSE} SeekDeep makeSampleDirectories ``` ```{r, engine='bash',comment="",highlight=TRUE, echo=FALSE} SeekDeep makeSampleDirectories | gsed -r "s/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[m|K]//g" | head -15 echo ... echo ... ``` Also calling `-help` will do the same ```{r, engine='bash',comment="",eval=FALSE} SeekDeep makeSampleDirectories --help ``` Also all flags in SeekDeep are case insensitive and so all the following would have the same results ```{r, engine='bash',comment="",eval=FALSE} SeekDeep makeSampleDirectories --help SeekDeep MakeSampleDirectories --HELP SeekDeep makeSampleDirectories --HeLP SeekDeep MAKESAMPLEDIRECTORIES --HeLp SeekDeep makeSampleDirectories --HeLP SeekDeep makesampledirectories --HeLp SeekDeep makesampleDirectories --HeLp ``` ```{r, engine='bash',comment="",eval=FALSE, echo=FALSE} SeekDeep makeSampleDirectories --getFlags ``` ```{r, engine='bash',comment="",highlight=TRUE, eval=FALSE, echo=FALSE} SeekDeep makeSampleDirectories --getFlags | gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g" | head -20 echo ... echo ... ``` # Format of set up file The set up file contains at least three columns. The first column is a identifier for the sequence file that contains the sample sequence's data. The second column is the name of the sample. The third column is the name of the MID for that sample and each additional column is another replicate for that sample in the that file. Any line that starts with a # or is a blank line will be ignored. ## Dual replicates Below would be for example of a sequencing experiment that included 20 samples and each sample contained 2 PCR replicates and the sequencing itself was done on two lanes or cells (eg. Two Illumina Lanes or two Different Ion torrent runs) ```{r, engine='bash',comment="",eval=FALSE} cat exampleSampleNames.tab.txt ``` ```{r, engine='bash',comment="",eval=F, echo=FALSE} cat ../extraFiles/exampleSampleNames.tab.txt ``` ```{r, engine='bash',comment="",eval=FALSE} SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters ``` ```{r, engine='bash',comment="",eval=F, echo=FALSE} rm -r filesForProcessClusters SeekDeep makeSampleDirectories --file ../extraFiles/exampleSampleNames.tab.txt --dout filesForProcessClusters --overWriteDir | gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g" | head -20 echo ... echo ... ``` ## Single replicates Below would be for example of a sequencing experiment that included 20 samples and each sample contained only 1 PCR replicates and the sequencing itself was done on two lanes or cells (eg. Two Illumina Lanes or two Different Ion torrent runs) ```{r, engine='bash',comment="",eval=FALSE} cat exampleSampleNamesSingleReps.tab.txt ``` ```{r, engine='bash',comment="",eval=F, echo=FALSE} cat ../extraFiles/exampleSampleNamesSingleReps.tab.txt ``` ```{r, engine='bash',comment="",eval=FALSE} SeekDeep makeSampleDirectories --file exampleSampleNamesSingleReps.tab.txt --dout filesForProcessClusters ``` ```{r, engine='bash',comment="",eval=F, echo=FALSE} rm -r filesForProcessClusters SeekDeep makeSampleDirectories --file ../extraFiles/exampleSampleNamesSingleReps.tab.txt --dout filesForProcessClusters --overWriteDir | gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g" | head -20 echo ... echo ... ``` ## mixed replicates Now you might have a experiment like the one with dual replicates but maybe you only one replicate for a couple of samples maybe due to poor amplification or someone other reason. This can also be done simply by mixing the two above formats. ```{r, engine='bash',comment="",eval=FALSE} cat exampleSampleNamesMixed.tab.txt ``` ```{r, engine='bash',comment="",eval=F, echo=FALSE} cat ../extraFiles/exampleSampleNamesMixed.tab.txt ``` ```{r, engine='bash',comment="",eval=FALSE} SeekDeep makeSampleDirectories --file exampleSampleNamesMixed.tab.txt --dout filesForProcessClusters ``` ```{r, engine='bash',comment="",eval=F, echo=FALSE} rm -r filesForProcessClusters SeekDeep makeSampleDirectories --file ../extraFiles/exampleSampleNamesMixed.tab.txt --dout filesForProcessClusters --overWriteDir | gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g" | head -20 echo ... echo ... ``` # Output directory name The output directory will be what is given by the `-dout` flag. This will never overwrite an existing directory. The `-dout` flag also will interpret the word TODAY in all caps to mean to insert the current date and time instead. ```{r, engine='bash',comment="",eval=FALSE} SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY ``` # Output files A master directory will be built using the name given by `-dout`. In this directory will be a directory tree that is need by [SeekDeep processClusters](processClusters_usage.html) where there is a top directory containing directories for the samples in the analysis and each sample directory contains all replicate directories for that sample. ```{r, engine='bash',comment="",eval=FALSE} SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY cd filesForProcessClusters_2015-06-14_12.22 tree ``` ```{r, engine='bash',comment="",eval=F, echo=FALSE} rm -r filesForProcessClusters_201* SeekDeep makeSampleDirectories --file ../extraFiles/exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY --overWriteDir >temp cd filesForProcessClusters_201* tree -A | head -20 echo ... echo ... ``` ## Location file for qluster Another directory, called locationByIndex, will contain a file for intended use with [SeekDeep qluster](qluster_usage.html) to help direct output to this directory. This is done by giving these files to qluster using the `-additionalOut` flag with qluster, see [here](qluster_usage.html#additional-alternative-directory-output) for details on this flag. The idea here is each input sequence file will contain mids and once the files have been extracted by [SeekDeep extractor](extractor_usage.html) will be in separate files. The above example had two seq input files. When running qluster on the first file you give the location file for that file. Here are the files from the above example ```{r, engine='bash',comment="",eval=FALSE} SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY ls filesForProcessClusters_2015-06-14_12.22/locationByIndex ``` ```{r, engine='bash',comment="",eval=F, echo=FALSE} #rm -r filesForProcessClusters_2017* SeekDeep makeSampleDirectories --file ../extraFiles/exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY --overWriteDir >temp ls filesForProcessClusters_201*/locationByIndex ``` Each file contains two columns, the first is the MID name and the second is the location where the output of qluster should go for the clustering of that MID. ```{r, engine='bash',comment="",eval=FALSE} SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY cat filesForProcessClusters_2015-06-14_12.22/locationByIndex/1.tab.txt ``` ```{r, engine='bash',comment="",eval=F, echo=FALSE} rm -fr filesForProcessClusters_201* SeekDeep makeSampleDirectories --file ../extraFiles/exampleSampleNames.tab.txt --dout filesForProcessClusters_TODAY --overWriteDir >temp cat filesForProcessClusters_201*/locationByIndex/1.tab.txt | gsed 's/\/.*SeekDeep/\/home\/user/g' | sort rm -fr filesForProcessClusters_201* ``` Now when running qluster on files on the results from the extraction of the input files give the appropriate location file. Extraction being done with the following file ```{r, engine='bash',comment="",eval=F, echo=T} cat pfama1_ids.tab.txt ``` ```{r, engine='bash',comment="",eval=T, echo=FALSE} cat ../extraFiles/pfama1_ids.tab.txt | head -10 echo ... echo ... ``` ```{r, engine='bash',comment="",eval=FALSE} SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters SeekDeep extractor --fastq input1.fastq --id ids.txt --dout extraction1 cd extraction1 SeekDeep qluster --fastq PFAMA1MID01.fastq --par pars.tab.txt --additionalOut ../filesForProcessClusters/locationByIndex/1.tab.txt ... cd .. SeekDeep extractor --fastq input2.fastq --id ids.txt --dout extraction2 cd extraction2 SeekDeep qluster --fastq PFAMA1MID01.fastq --par pars.tab.txt --additionalOut ../filesForProcessClusters/locationByIndex/2.tab.txt ... #now after running qluster on all the files from extraction things are already ready for processClusters cd ../filesForProcessClusters SeekDeep processClusters --fastq output.fastq --par pars.txt ```