Downloading Example Data
wget http://baileylab.brown.edu/MIPWrangler/data/tutorial/tutorial.tar.gz
tar -zxvf tutorial.tar.gz
cd tutorial
Running MIPWrangler Analysis Pipeline
MIPWrangler mipSetupAndExtractByArm
First unzip the tutorial directory
tar -zxvf 171030_nextseq_controls
cd 171030_nextseq_controls
nohup MIPWrangler mipSetupAndExtractByArm --masterDir analysis --dir fastq --mipSampleFile ids/allMipsSamplesNames.tab.txt --mipArmsFilename ids/mipArms.txt --numThreads 20 --mipServerNumber 1 --refDir ../ideelExtract/fastas/byFamily/ --runRest --samplesMeta ids/metaData.tab.txt &
Options
Required
- --masterDir - A name of a directory in which all the downstream analysis will be conducted
- --dir - The directory of input raw paired end Illumina data
- --mipSampleFile - A file containing what mips and samples are in the current analysis
- --mipArmsFilename - A file describing the mip set used in this analysis, needs to have arm information and barcode size, see below for description
- --mipServerNumber - A number to indicate what you would like to call the final mip server (1 = mip1, 2 = mip2) this controls the naming of the directory in serverResources which will contain the final master data table and extraction information
Optional
- --refDir - A directory that contains expected sequence to compare final results haplotypes to
- --numThreads - The number of CPUs to utilize
- --runRest - Run the rest of the downstream analysis pipeline, by default this command will just run the first step which is setting up the analysis directory and extracting from the raw sample fastq the mip sequence using the arms, the rest of the pipeline of barcode correcting, haplotype clustering, and population clustering plus filtering commands are put in a directory called scripts which will be run if this flag is set or can be ran at a later time.
- --samplesMeta - A file that contains meta data about the samples, this needs to be a minimum of 2 columns, one column named sample and each additional column is a meta field, this information will be added to final results datatables.
mipArmsFilename
This file describes MIP probes and requires the below columns
- mip_id - A unique identifier for this MIP probe
- mip_family - A mip_family that a MIP probe could belong to, data for for each MIP probe that belong to a single family are combined
- extension_arm - The extension arm sequence of the MIP probe (5
-3
)
- ligation_arm - The ligation arm sequence of the MIP probe (5
-3
)
- extension_barcode_length - The length of the molecular barcode associated with the extension arm (could be 0 indicating no molecular barcode assoicated with this arm)
- ligation_barcode_length - The length of the molecular barcode associated with the ligation arm (could be 0 indicating no molecular barcode assoicated with this arm)
- gene_name - Name for a group of mip_family, could be used to indicate MIPs that capture a similar overlaping region
- mipset - A name to organize a series of MIPs that might be commonly used together in an analysis
mipSampleFile
This file will have two columns, 1) mips and 2) samples which can appear in either order. The mips columns will list all the mips that are used in this analysis run, for each mip named in this column an entry describing this mip must be found in --mipArmsFilename file, the name will match the column mip_family (ask Ozkan Aydemir why there is a mip_target and mip_family designation). The samples column names the samples used from the input raw data directory named by the --dir flag. The way this naming scheme works is the sample name for files is determined by taking everything before the first underscore in the filename (e.g. the sample name for D6-JJJ-1_S91_R1_001.fastq.gz is D6-JJJ-1)
Output
The analysis directory is large and is current redundant which will probably change in the future but while the program is under development the directory structure shall remain as is.
There will be a directory for each sample which contains results of extraction by arm sequence, barcode correction, and clustering.
- logs - A directory of logs of the running of the analysis
- populationClustering - A directory that contains the cross sample clustering for each mip target
- resources - Resources copied into the analysis directory from the input, (–mipArmsFilename, –mipSampleFile, –samplesMeta, etc)
- scripts - A directory of executable scripts that will run the rest of the analysis
- serverResources - A directory containing final data, will be named depending on given mip server number given above (a directory named mip1 will exist if –mipServerNumber 1, mip2 for –mipServerNumber 2, etc )
serverResources
There will be two directories in the mip server directory, 1) extractionInfo 2) popClusInfo
extractionInfo
- allExtractInfoByTarget.tab.txt - The extraction information for all samples per target including filtering stats
- allExtractInfoSummary.tab.txt - A summary of the extraction for each sample
- allStitchInfoByTarget.tab.txt - A summary of the stitching of the extracted targets for all samples
allStitchInfoByTarget.tab.txt
The file below contains how the paired end read stitching went for each target per sample
- Sample - The name of the sample
- mipTarget - The mip target
- mipFamily - The mip family the target belongs to
- total - The total number attempted to stitch
- r1EndsInR2 - The total number of reads that stitched where the r1 reads ends in the r2 read (target length < than the paired end length x 2, so no read through)
- r1BeginsInR2 - Read through reads, most likely artifact
- OverlapFail - Reads that had too much error in overlap and couldn’t stitch or a overlap couldn’t be found
- PerfectOverlap - Reads that perfectly overlap, an unlikely scernario
popClusInfo
- allInfo.tab.txt.gz - A master table that contains virtually all of the final results of the pipeline
popClusInfo header information
Each line in this table represent a haplotype for a given target within a given sample which basically represents the smallest divisible data unit for the results of the MIPWrangler analysis pipleine. Due to this structure a lot of the information in this table is redundant (for haplotypes coming from the same sample, the columns containing the data for the sample will be repeated for each haplotype).
Some of these column names are left over from other projects and may change or might not make much sense.
The term “cluster” is used often to refer haplotypes within a sample for a given target with the philosphy that the pipeline clusters together reads for a sample into possible haplotypes and don’t become haplotypes until final filtering and population comparison.
The term “reads” is used to refer to the raw sequence reads from the sequencer.
The term “barcode” below is used to refer to molecular barcodes that are incorporated into the MIP targets that are used to tag each individual MIP capture.
The columns are named so that columns that begin with a certain prefix refer to the same category of data
Columns that begin with:
- p_ - Refer to specific target
- h_ - Refer to a population haplotype for a specific target
- s_ - Refer to sample
- c_ - Refer to information for a haplotype within a sample
Columns
- s_Sample - The name of the sample the haplotype is from
- p_geneName - The group a specific mip target belongs to, mips were originally designed just for genes which is why it is named this
- p_targetName - The name of the MIP target
- p_sampleTotal - The number of samples that have data for a specific MIP target
- p_totalInputClusters - The total number of input haplotypes from all samples
- p_readTotal - The total number of reads that contribute to this target
- p_barcodeTotal - The total number of barcodes that contribute to this target
- p_finalHaplotypeNumber - The final number of unique haplotypes found for this target
- h_popUID - The population identifier for this haplotype for this target so it can be compared against samples
- h_mipPopUID - Similar to the h_popUID but this column has no name restrictions (this was added due to downstream analysis pipelines)
- h_sampleCnt - The total number of samples this haplotype appears in
- h_sampleFrac - The fraction of samples out of total samples this haplotype appears in (1 would mean it appears in all samples)
- h_medianBarcodeFrac - The median number of barcodes this population haplotype had across samples
- h_meanBarcodeFrac - The median number of barcodes this population haplotype had across samples
- h_readCnt - The number of reads across all samples this population haplotype has
- h_readFrac - The fraction of total reads this population haplotype has out of all reads found for this target
- h_barcodeCnt - The total number of barcodes that contribute to this population haplotype across all samples
- h_barcodeFrac - The fraction of the total barcodes for this target that this population haplotype has
- h_inputNames - The names of all haplotypes from the samples that match this population haplotype
- h_seq - The DNA sequence of this population haplotype
- h_qual - The per base quality scores for h_seq encoded in fastq quality scores
- s_sName - The name of the sample the haplotype is from
- s_usedTotalClusterCnt - The total number of clusters that this sample (s_sName) has for this target (p_targetName)
- s_usedTotalReadCnt - The total number of reads that this sample (s_sName) has for this target (p_targetName)
- s_usedTotalBarcodeCnt - The total number of barcodes that this sample (s_sName) has for this target (p_targetName)
- s_inputTotalClusterCnt - The total number of clusters that this sample (s_sName) had before final filtering for this target (p_targetName)
- s_inputTotalReadCnt - The total number of reads that this sample (s_sName) had before final filtering for this target (p_targetName)
- s_inputTotalBarcodeCnt - The total number of barcodes that this sample (s_sName) had before final filtering for this target (p_targetName)
- s_chiTotalClusterCnt - The total number of clusters lost to chimera filtering for this sample (s_sName) for this target (p_targetName)
- s_chiTotalReadCnt - The total number of reads lost to chimera filtering for this sample (s_sName) for this target (p_targetName)
- s_chiTotalBarcodeCnt - The total number of barcodes lost to chimera filtering for this sample (s_sName) for this target (p_targetName)
- s_lowFreqTotalClusterCnt - The total number of clusters lost to low frequency filtering for this sample (s_sName) for this target (p_targetName)
- s_lowFreqTotalReadCnt - The total number of reads lost to low frequency filtering for this sample (s_sName) for this target (p_targetName)
- s_lowFreqTotalBarcodeCnt - The total number of barcodes lost to low frequency filtering for this sample (s_sName) for this target (p_targetName)
- c_clusterID - The identifier for this cluster in this sample (s_sName) in this target (p_targetName), this will always be a number and starts at 0, the lower the number the more abundant the cluster is in the sample (s_sName)
- c_name - A name for the cluster, this will have meta data incorporated into it for barcode and read counts etc.
- c_readCnt - The total number of reads for this cluster (c_name)
- c_readFrac - The fraction of reads for this cluster (c_name) out of the total reads for this sample (s_sName) for this target (p_targetName)
- c_barcodeCnt - The total number of barcodes for this cluster (c_name)
- c_barcodeFrac - The fraction of barcodes for this cluster (c_name) out of the total barcodes for this sample (s_sName) for this target (p_targetName)
- c_seq - The DNA sequence of this cluster
- c_qual - The per base quality scores for c_seq encoded in fastq quality scores
- c_length - The length of the DNA sequence c_seq
- c_bestExpected - If comparing to expected sequence, the best matching supplied sequence