Downloading Example Data

wget http://baileylab.brown.edu/MIPWrangler/data/tutorial/tutorial.tar.gz
tar -zxvf tutorial.tar.gz 
cd tutorial

Extracting MIP extractions from Genome

Using the input MIP arm file and a directory of fasta files (normally genomes) you can extract expected sequence for the given set of mips. This is achieved by using MIPWrangler setUpViewMipsOnGenome. This program utilizes bowtie2 hits with the arms and extracts the sequence between the hits (So for the program to work bowtie2 needs to be installed).

First extract the genomes directory provided with this tutorial. This is a set of genomes from the MalariaGEN’s PF3k project and then the Pf3d7 genomes/gff procvided by PlasmoDB. The subset of Pf3k genomes used exclude three of the genomes due to errors in their assemblies.

tar -zxvf selected_pf3k_genomes.tar.gz

For the control data used for this tutorial uses a mixture of 4 lab strains (Pf3d7,PfDd2,PfHB3,Pf7G8) so we will only extract expected sequence for these strains.

MIPWrangler setUpViewMipsOnGenome

MIPWrangler setUpViewMipsOnGenome --inputDir selected_pf3k_genomes --masterDir ideelExtract --selectGenomes Pf3d7,PfDd2,PfHB3,Pf7G8 --mipArmsFnp 171030_nextseq_controls/ids/mipArms.txt --primaryGenome Pf3d7 --numThreads 6 --hqMismatches 6 &

Options

Required

--inputDir - The directory containing genomes and gff files, the directory needs to be set up with two different directories, 1) genomes and 2) info (which contains a directory called gff). Only the genomes directory is required but if the gff directory is present then this information will be added to the expected sequence extraction to see what genes are being hit.
--masterDir - A name of an output directory to create to hold the extractions
--mipArmsFnp - The mip file that contains the mip arm information, see below for an explanation of this file
--primaryGenome - The name of the genome that is the primary reference genome in the genomes directory, should be the name minus the .fasta extension (e.g. Pf3d7 for Pf3d7.fasta)

Optional

--selectGenomes - A comma separated list of genomes to extract from, the default is to extract from all files that end with “.fasta” in the genomes directory
--numThreads - The number of CPUs to utlize
--hqMismatches - The number of mismatches to allow in the arm sequence to allow for a “hit”

Ouput

MIPWrangler setUpViewMipsOnGenome will generate a large directory with several different files for information of extractions including expected sequence, number of hits, etc.

arms - A directory of fasta files of the arm sequences, this mostly used internally for the mapping
beds - A directory of bed files for location of hits for the mip targets, a file is created for possible extractions and then hits for both extension and ligation arms. There is a also a directory called perGenome which condenses hits for all extractions for each genome and then will also intersect with the GFF file if provided what genes these extractions hit.
fastas - A directory of fasta files of the expected sequence
logs - A directory of run logs
mapped - A directory of bams of the hits of the arms
tables - A directory of tables for information of the extractions, the file extractionCountsTable.tab.txt contain the number of times each target hit a genome and the file allTarInfo_Pf3d7.tab.txt contains information about how diverse the extractions are (number of unique sequences between the different genomes), whether there is possible length variation, if the extractions contain tandem repeats, etc.

Running MIPWrangler Analysis Pipeline

MIPWrangler mipSetupAndExtractByArm

First unzip the tutorial directory

tar -zxvf 171030_nextseq_controls

cd 171030_nextseq_controls 
nohup MIPWrangler mipSetupAndExtractByArm --masterDir analysis --dir fastq --mipSampleFile ids/allMipsSamplesNames.tab.txt --mipArmsFilename ids/mipArms.txt --numThreads 20 --mipServerNumber 1 --refDir ../ideelExtract/fastas/byFamily/ --runRest --samplesMeta ids/metaData.tab.txt &

Options

Required

--masterDir - A name of a directory in which all the downstream analysis will be conducted
--dir - The directory of input raw paired end Illumina data
--mipSampleFile - A file containing what mips and samples are in the current analysis
--mipArmsFilename - A file describing the mip set used in this analysis, needs to have arm information and barcode size, see below for description
--mipServerNumber - A number to indicate what you would like to call the final mip server (1 = mip1, 2 = mip2) this controls the naming of the directory in serverResources which will contain the final master data table and extraction information

Optional

--refDir - A directory that contains expected sequence to compare final results haplotypes to
--numThreads - The number of CPUs to utilize
--runRest - Run the rest of the downstream analysis pipeline, by default this command will just run the first step which is setting up the analysis directory and extracting from the raw sample fastq the mip sequence using the arms, the rest of the pipeline of barcode correcting, haplotype clustering, and population clustering plus filtering commands are put in a directory called scripts which will be run if this flag is set or can be ran at a later time.
--samplesMeta - A file that contains meta data about the samples, this needs to be a minimum of 2 columns, one column named sample and each additional column is a meta field, this information will be added to final results datatables.

mipArmsFilename

This file describes MIP probes and requires the below columns

mip_id - A unique identifier for this MIP probe
mip_family - A mip_family that a MIP probe could belong to, data for for each MIP probe that belong to a single family are combined
extension_arm - The extension arm sequence of the MIP probe (5-3)
ligation_arm - The ligation arm sequence of the MIP probe (5-3)
extension_barcode_length - The length of the molecular barcode associated with the extension arm (could be 0 indicating no molecular barcode assoicated with this arm)
ligation_barcode_length - The length of the molecular barcode associated with the ligation arm (could be 0 indicating no molecular barcode assoicated with this arm)
gene_name - Name for a group of mip_family, could be used to indicate MIPs that capture a similar overlaping region
mipset - A name to organize a series of MIPs that might be commonly used together in an analysis

mipSampleFile

This file will have two columns, 1) mips and 2) samples which can appear in either order. The mips columns will list all the mips that are used in this analysis run, for each mip named in this column an entry describing this mip must be found in --mipArmsFilename file, the name will match the column mip_family (ask Ozkan Aydemir why there is a mip_target and mip_family designation). The samples column names the samples used from the input raw data directory named by the --dir flag. The way this naming scheme works is the sample name for files is determined by taking everything before the first underscore in the filename (e.g. the sample name for D6-JJJ-1_S91_R1_001.fastq.gz is D6-JJJ-1)

Output

The analysis directory is large and is current redundant which will probably change in the future but while the program is under development the directory structure shall remain as is.

There will be a directory for each sample which contains results of extraction by arm sequence, barcode correction, and clustering.

logs - A directory of logs of the running of the analysis
populationClustering - A directory that contains the cross sample clustering for each mip target
resources - Resources copied into the analysis directory from the input, (–mipArmsFilename, –mipSampleFile, –samplesMeta, etc)
scripts - A directory of executable scripts that will run the rest of the analysis
serverResources - A directory containing final data, will be named depending on given mip server number given above (a directory named mip1 will exist if –mipServerNumber 1, mip2 for –mipServerNumber 2, etc )

serverResources

There will be two directories in the mip server directory, 1) extractionInfo 2) popClusInfo

extractionInfo

allExtractInfoByTarget.tab.txt - The extraction information for all samples per target including filtering stats
allExtractInfoSummary.tab.txt - A summary of the extraction for each sample
allStitchInfoByTarget.tab.txt - A summary of the stitching of the extracted targets for all samples

allExtractInfoByTarget.tab.txt

This table combines the extraction stats of per target per sample so the performance of each mip target across samples can be assessed. First the reads are matched by extension arm and then the paried reads are stitched together and filtered on several other filtering criteria.

Sample - The name of the sample
mipTarget - The mip target
mipFamily - The mip family the target belongs to
totalMatched - The total number of reads that matched the extension arm
goodReads - The total number of reads that past all filters
failedLigationArm - The number of reads that failed to match the ligation arm after matching the extension arm
failedMinLen(<150) - The number of reads that failed the min length filter
failed_q30<0.75 - The number of reads that failed the quality score filtering
containsNs - The number of reads that contained N’s
badStitch - The number of reads that failed to stitch

allExtractInfoSummary.tab.txt

This file contains similar columns to the table describe above but is summarized over all targets for each samples and then also contains several other values listed below.

Sample - Name of the sample
total - The total number of raw reads for a sample
unmatched - The total number of reads that failed to match any arms
indeterminate - The total number of reads that matched too closely several mips and couldn’t be determined to be either one
smallFragment - The total number of small fragment reads (this is mostly left over from when things were first designed on other platforms other than Illumina)
totalMatched - The total number out of the total raw reads that were matched to a mip target.

allStitchInfoByTarget.tab.txt

The file below contains how the paired end read stitching went for each target per sample

Sample - The name of the sample
mipTarget - The mip target
mipFamily - The mip family the target belongs to
total - The total number attempted to stitch
r1EndsInR2 - The total number of reads that stitched where the r1 reads ends in the r2 read (target length < than the paired end length x 2, so no read through)
r1BeginsInR2 - Read through reads, most likely artifact
OverlapFail - Reads that had too much error in overlap and couldn’t stitch or a overlap couldn’t be found
PerfectOverlap - Reads that perfectly overlap, an unlikely scernario

popClusInfo

allInfo.tab.txt.gz - A master table that contains virtually all of the final results of the pipeline

popClusInfo header information
Each line in this table represent a haplotype for a given target within a given sample which basically represents the smallest divisible data unit for the results of the MIPWrangler analysis pipleine. Due to this structure a lot of the information in this table is redundant (for haplotypes coming from the same sample, the columns containing the data for the sample will be repeated for each haplotype).

Some of these column names are left over from other projects and may change or might not make much sense.

The term “cluster” is used often to refer haplotypes within a sample for a given target with the philosphy that the pipeline clusters together reads for a sample into possible haplotypes and don’t become haplotypes until final filtering and population comparison.

The term “reads” is used to refer to the raw sequence reads from the sequencer.

The term “barcode” below is used to refer to molecular barcodes that are incorporated into the MIP targets that are used to tag each individual MIP capture.

The columns are named so that columns that begin with a certain prefix refer to the same category of data

Columns that begin with:

p_ - Refer to specific target
h_ - Refer to a population haplotype for a specific target
s_ - Refer to sample
c_ - Refer to information for a haplotype within a sample

Columns

s_Sample - The name of the sample the haplotype is from
p_geneName - The group a specific mip target belongs to, mips were originally designed just for genes which is why it is named this
p_targetName - The name of the MIP target
p_sampleTotal - The number of samples that have data for a specific MIP target
p_totalInputClusters - The total number of input haplotypes from all samples
p_readTotal - The total number of reads that contribute to this target
p_barcodeTotal - The total number of barcodes that contribute to this target
p_finalHaplotypeNumber - The final number of unique haplotypes found for this target
h_popUID - The population identifier for this haplotype for this target so it can be compared against samples
h_mipPopUID - Similar to the h_popUID but this column has no name restrictions (this was added due to downstream analysis pipelines)
h_sampleCnt - The total number of samples this haplotype appears in
h_sampleFrac - The fraction of samples out of total samples this haplotype appears in (1 would mean it appears in all samples)
h_medianBarcodeFrac - The median number of barcodes this population haplotype had across samples
h_meanBarcodeFrac - The median number of barcodes this population haplotype had across samples
h_readCnt - The number of reads across all samples this population haplotype has
h_readFrac - The fraction of total reads this population haplotype has out of all reads found for this target
h_barcodeCnt - The total number of barcodes that contribute to this population haplotype across all samples
h_barcodeFrac - The fraction of the total barcodes for this target that this population haplotype has
h_inputNames - The names of all haplotypes from the samples that match this population haplotype
h_seq - The DNA sequence of this population haplotype
h_qual - The per base quality scores for h_seq encoded in fastq quality scores
s_sName - The name of the sample the haplotype is from
s_usedTotalClusterCnt - The total number of clusters that this sample (s_sName) has for this target (p_targetName)
s_usedTotalReadCnt - The total number of reads that this sample (s_sName) has for this target (p_targetName)
s_usedTotalBarcodeCnt - The total number of barcodes that this sample (s_sName) has for this target (p_targetName)
s_inputTotalClusterCnt - The total number of clusters that this sample (s_sName) had before final filtering for this target (p_targetName)
s_inputTotalReadCnt - The total number of reads that this sample (s_sName) had before final filtering for this target (p_targetName)
s_inputTotalBarcodeCnt - The total number of barcodes that this sample (s_sName) had before final filtering for this target (p_targetName)
s_chiTotalClusterCnt - The total number of clusters lost to chimera filtering for this sample (s_sName) for this target (p_targetName)
s_chiTotalReadCnt - The total number of reads lost to chimera filtering for this sample (s_sName) for this target (p_targetName)
s_chiTotalBarcodeCnt - The total number of barcodes lost to chimera filtering for this sample (s_sName) for this target (p_targetName)
s_lowFreqTotalClusterCnt - The total number of clusters lost to low frequency filtering for this sample (s_sName) for this target (p_targetName)
s_lowFreqTotalReadCnt - The total number of reads lost to low frequency filtering for this sample (s_sName) for this target (p_targetName)
s_lowFreqTotalBarcodeCnt - The total number of barcodes lost to low frequency filtering for this sample (s_sName) for this target (p_targetName)
c_clusterID - The identifier for this cluster in this sample (s_sName) in this target (p_targetName), this will always be a number and starts at 0, the lower the number the more abundant the cluster is in the sample (s_sName)
c_name - A name for the cluster, this will have meta data incorporated into it for barcode and read counts etc.
c_readCnt - The total number of reads for this cluster (c_name)
c_readFrac - The fraction of reads for this cluster (c_name) out of the total reads for this sample (s_sName) for this target (p_targetName)
c_barcodeCnt - The total number of barcodes for this cluster (c_name)
c_barcodeFrac - The fraction of barcodes for this cluster (c_name) out of the total barcodes for this sample (s_sName) for this target (p_targetName)
c_seq - The DNA sequence of this cluster
c_qual - The per base quality scores for c_seq encoded in fastq quality scores
c_length - The length of the DNA sequence c_seq
c_bestExpected - If comparing to expected sequence, the best matching supplied sequence

Running MIPWrangler

Nicholas Hathaway

Downloading Example Data

Extracting MIP extractions from Genome

MIPWrangler setUpViewMipsOnGenome

Options

Required

Optional

Ouput

Running MIPWrangler Analysis Pipeline

MIPWrangler mipSetupAndExtractByArm

Options

Required

Optional

mipArmsFilename

mipSampleFile

Output

serverResources

allExtractInfoByTarget.tab.txt

allExtractInfoSummary.tab.txt

allStitchInfoByTarget.tab.txt