Downloading Example Data

wget http://baileylab.brown.edu/MIPWrangler/data/tutorial/tutorial.tar.gz
tar -zxvf tutorial.tar.gz 
cd tutorial

Extracting MIP extractions from Genome

Using the input MIP arm file and a directory of fasta files (normally genomes) you can extract expected sequence for the given set of mips. This is achieved by using MIPWrangler setUpViewMipsOnGenome. This program utilizes bowtie2 hits with the arms and extracts the sequence between the hits (So for the program to work bowtie2 needs to be installed).

First extract the genomes directory provided with this tutorial. This is a set of genomes from the MalariaGEN’s PF3k project and then the Pf3d7 genomes/gff procvided by PlasmoDB. The subset of Pf3k genomes used exclude three of the genomes due to errors in their assemblies.

tar -zxvf selected_pf3k_genomes.tar.gz

For the control data used for this tutorial uses a mixture of 4 lab strains (Pf3d7,PfDd2,PfHB3,Pf7G8) so we will only extract expected sequence for these strains.

MIPWrangler setUpViewMipsOnGenome

MIPWrangler setUpViewMipsOnGenome --inputDir selected_pf3k_genomes --masterDir ideelExtract --selectGenomes Pf3d7,PfDd2,PfHB3,Pf7G8 --mipArmsFnp 171030_nextseq_controls/ids/mipArms.txt --primaryGenome Pf3d7 --numThreads 6 --hqMismatches 6 & 

Options

Required

  • --inputDir - The directory containing genomes and gff files, the directory needs to be set up with two different directories, 1) genomes and 2) info (which contains a directory called gff). Only the genomes directory is required but if the gff directory is present then this information will be added to the expected sequence extraction to see what genes are being hit.
  • --masterDir - A name of an output directory to create to hold the extractions
  • --mipArmsFnp - The mip file that contains the mip arm information, see below for an explanation of this file
  • --primaryGenome - The name of the genome that is the primary reference genome in the genomes directory, should be the name minus the .fasta extension (e.g. Pf3d7 for Pf3d7.fasta)

Optional

  • --selectGenomes - A comma separated list of genomes to extract from, the default is to extract from all files that end with “.fasta” in the genomes directory
  • --numThreads - The number of CPUs to utlize
  • --hqMismatches - The number of mismatches to allow in the arm sequence to allow for a “hit”

Ouput

MIPWrangler setUpViewMipsOnGenome will generate a large directory with several different files for information of extractions including expected sequence, number of hits, etc.

  • arms - A directory of fasta files of the arm sequences, this mostly used internally for the mapping
  • beds - A directory of bed files for location of hits for the mip targets, a file is created for possible extractions and then hits for both extension and ligation arms. There is a also a directory called perGenome which condenses hits for all extractions for each genome and then will also intersect with the GFF file if provided what genes these extractions hit.
  • fastas - A directory of fasta files of the expected sequence
  • logs - A directory of run logs
  • mapped - A directory of bams of the hits of the arms
  • tables - A directory of tables for information of the extractions, the file extractionCountsTable.tab.txt contain the number of times each target hit a genome and the file allTarInfo_Pf3d7.tab.txt contains information about how diverse the extractions are (number of unique sequences between the different genomes), whether there is possible length variation, if the extractions contain tandem repeats, etc.

Running MIPWrangler Analysis Pipeline

MIPWrangler mipSetupAndExtractByArm

First unzip the tutorial directory

tar -zxvf 171030_nextseq_controls  
cd 171030_nextseq_controls 
nohup MIPWrangler mipSetupAndExtractByArm --masterDir analysis --dir fastq --mipSampleFile ids/allMipsSamplesNames.tab.txt --mipArmsFilename ids/mipArms.txt --numThreads 20 --mipServerNumber 1 --refDir ../ideelExtract/fastas/byFamily/ --runRest --samplesMeta ids/metaData.tab.txt & 

Options

Required

  • --masterDir - A name of a directory in which all the downstream analysis will be conducted
  • --dir - The directory of input raw paired end Illumina data
  • --mipSampleFile - A file containing what mips and samples are in the current analysis
  • --mipArmsFilename - A file describing the mip set used in this analysis, needs to have arm information and barcode size, see below for description
  • --mipServerNumber - A number to indicate what you would like to call the final mip server (1 = mip1, 2 = mip2) this controls the naming of the directory in serverResources which will contain the final master data table and extraction information

Optional

  • --refDir - A directory that contains expected sequence to compare final results haplotypes to
  • --numThreads - The number of CPUs to utilize
  • --runRest - Run the rest of the downstream analysis pipeline, by default this command will just run the first step which is setting up the analysis directory and extracting from the raw sample fastq the mip sequence using the arms, the rest of the pipeline of barcode correcting, haplotype clustering, and population clustering plus filtering commands are put in a directory called scripts which will be run if this flag is set or can be ran at a later time.
  • --samplesMeta - A file that contains meta data about the samples, this needs to be a minimum of 2 columns, one column named sample and each additional column is a meta field, this information will be added to final results datatables.

mipArmsFilename

This file describes MIP probes and requires the below columns

  • mip_id - A unique identifier for this MIP probe
  • mip_family - A mip_family that a MIP probe could belong to, data for for each MIP probe that belong to a single family are combined
  • extension_arm - The extension arm sequence of the MIP probe (5-3)
  • ligation_arm - The ligation arm sequence of the MIP probe (5-3)
  • extension_barcode_length - The length of the molecular barcode associated with the extension arm (could be 0 indicating no molecular barcode assoicated with this arm)
  • ligation_barcode_length - The length of the molecular barcode associated with the ligation arm (could be 0 indicating no molecular barcode assoicated with this arm)
  • gene_name - Name for a group of mip_family, could be used to indicate MIPs that capture a similar overlaping region
  • mipset - A name to organize a series of MIPs that might be commonly used together in an analysis

mipSampleFile

This file will have two columns, 1) mips and 2) samples which can appear in either order. The mips columns will list all the mips that are used in this analysis run, for each mip named in this column an entry describing this mip must be found in --mipArmsFilename file, the name will match the column mip_family (ask Ozkan Aydemir why there is a mip_target and mip_family designation). The samples column names the samples used from the input raw data directory named by the --dir flag. The way this naming scheme works is the sample name for files is determined by taking everything before the first underscore in the filename (e.g. the sample name for D6-JJJ-1_S91_R1_001.fastq.gz is D6-JJJ-1)

Output

The analysis directory is large and is current redundant which will probably change in the future but while the program is under development the directory structure shall remain as is.

There will be a directory for each sample which contains results of extraction by arm sequence, barcode correction, and clustering.

  • logs - A directory of logs of the running of the analysis
  • populationClustering - A directory that contains the cross sample clustering for each mip target
  • resources - Resources copied into the analysis directory from the input, (–mipArmsFilename, –mipSampleFile, –samplesMeta, etc)
  • scripts - A directory of executable scripts that will run the rest of the analysis
  • serverResources - A directory containing final data, will be named depending on given mip server number given above (a directory named mip1 will exist if –mipServerNumber 1, mip2 for –mipServerNumber 2, etc )

serverResources

There will be two directories in the mip server directory, 1) extractionInfo 2) popClusInfo

extractionInfo

  • allExtractInfoByTarget.tab.txt - The extraction information for all samples per target including filtering stats
  • allExtractInfoSummary.tab.txt - A summary of the extraction for each sample
  • allStitchInfoByTarget.tab.txt - A summary of the stitching of the extracted targets for all samples
allExtractInfoByTarget.tab.txt

This table combines the extraction stats of per target per sample so the performance of each mip target across samples can be assessed. First the reads are matched by extension arm and then the paried reads are stitched together and filtered on several other filtering criteria.

  • Sample - The name of the sample
  • mipTarget - The mip target
  • mipFamily - The mip family the target belongs to
  • totalMatched - The total number of reads that matched the extension arm
  • goodReads - The total number of reads that past all filters
  • failedLigationArm - The number of reads that failed to match the ligation arm after matching the extension arm
  • failedMinLen(<150) - The number of reads that failed the min length filter
  • failed_q30<0.75 - The number of reads that failed the quality score filtering
  • containsNs - The number of reads that contained N’s
  • badStitch - The number of reads that failed to stitch
allExtractInfoSummary.tab.txt

This file contains similar columns to the table describe above but is summarized over all targets for each samples and then also contains several other values listed below.

  • Sample - Name of the sample
  • total - The total number of raw reads for a sample
  • unmatched - The total number of reads that failed to match any arms
  • indeterminate - The total number of reads that matched too closely several mips and couldn’t be determined to be either one
  • smallFragment - The total number of small fragment reads (this is mostly left over from when things were first designed on other platforms other than Illumina)
  • totalMatched - The total number out of the total raw reads that were matched to a mip target.
allStitchInfoByTarget.tab.txt

The file below contains how the paired end read stitching went for each target per sample

  • Sample - The name of the sample
  • mipTarget - The mip target
  • mipFamily - The mip family the target belongs to
  • total - The total number attempted to stitch
  • r1EndsInR2 - The total number of reads that stitched where the r1 reads ends in the r2 read (target length < than the paired end length x 2, so no read through)
  • r1BeginsInR2 - Read through reads, most likely artifact
  • OverlapFail - Reads that had too much error in overlap and couldn’t stitch or a overlap couldn’t be found
  • PerfectOverlap - Reads that perfectly overlap, an unlikely scernario

popClusInfo

  • allInfo.tab.txt.gz - A master table that contains virtually all of the final results of the pipeline

popClusInfo header information
Each line in this table represent a haplotype for a given target within a given sample which basically represents the smallest divisible data unit for the results of the MIPWrangler analysis pipleine. Due to this structure a lot of the information in this table is redundant (for haplotypes coming from the same sample, the columns containing the data for the sample will be repeated for each haplotype).

Some of these column names are left over from other projects and may change or might not make much sense.

The term “cluster” is used often to refer haplotypes within a sample for a given target with the philosphy that the pipeline clusters together reads for a sample into possible haplotypes and don’t become haplotypes until final filtering and population comparison.

The term “reads” is used to refer to the raw sequence reads from the sequencer.

The term “barcode” below is used to refer to molecular barcodes that are incorporated into the MIP targets that are used to tag each individual MIP capture.

The columns are named so that columns that begin with a certain prefix refer to the same category of data

Columns that begin with:

  • p_ - Refer to specific target
  • h_ - Refer to a population haplotype for a specific target
  • s_ - Refer to sample
  • c_ - Refer to information for a haplotype within a sample

Columns

  • s_Sample - The name of the sample the haplotype is from
  • p_geneName - The group a specific mip target belongs to, mips were originally designed just for genes which is why it is named this
  • p_targetName - The name of the MIP target
  • p_sampleTotal - The number of samples that have data for a specific MIP target
  • p_totalInputClusters - The total number of input haplotypes from all samples
  • p_readTotal - The total number of reads that contribute to this target
  • p_barcodeTotal - The total number of barcodes that contribute to this target
  • p_finalHaplotypeNumber - The final number of unique haplotypes found for this target
  • h_popUID - The population identifier for this haplotype for this target so it can be compared against samples
  • h_mipPopUID - Similar to the h_popUID but this column has no name restrictions (this was added due to downstream analysis pipelines)
  • h_sampleCnt - The total number of samples this haplotype appears in
  • h_sampleFrac - The fraction of samples out of total samples this haplotype appears in (1 would mean it appears in all samples)
  • h_medianBarcodeFrac - The median number of barcodes this population haplotype had across samples
  • h_meanBarcodeFrac - The median number of barcodes this population haplotype had across samples
  • h_readCnt - The number of reads across all samples this population haplotype has
  • h_readFrac - The fraction of total reads this population haplotype has out of all reads found for this target
  • h_barcodeCnt - The total number of barcodes that contribute to this population haplotype across all samples
  • h_barcodeFrac - The fraction of the total barcodes for this target that this population haplotype has
  • h_inputNames - The names of all haplotypes from the samples that match this population haplotype
  • h_seq - The DNA sequence of this population haplotype
  • h_qual - The per base quality scores for h_seq encoded in fastq quality scores
  • s_sName - The name of the sample the haplotype is from
  • s_usedTotalClusterCnt - The total number of clusters that this sample (s_sName) has for this target (p_targetName)
  • s_usedTotalReadCnt - The total number of reads that this sample (s_sName) has for this target (p_targetName)
  • s_usedTotalBarcodeCnt - The total number of barcodes that this sample (s_sName) has for this target (p_targetName)
  • s_inputTotalClusterCnt - The total number of clusters that this sample (s_sName) had before final filtering for this target (p_targetName)
  • s_inputTotalReadCnt - The total number of reads that this sample (s_sName) had before final filtering for this target (p_targetName)
  • s_inputTotalBarcodeCnt - The total number of barcodes that this sample (s_sName) had before final filtering for this target (p_targetName)
  • s_chiTotalClusterCnt - The total number of clusters lost to chimera filtering for this sample (s_sName) for this target (p_targetName)
  • s_chiTotalReadCnt - The total number of reads lost to chimera filtering for this sample (s_sName) for this target (p_targetName)
  • s_chiTotalBarcodeCnt - The total number of barcodes lost to chimera filtering for this sample (s_sName) for this target (p_targetName)
  • s_lowFreqTotalClusterCnt - The total number of clusters lost to low frequency filtering for this sample (s_sName) for this target (p_targetName)
  • s_lowFreqTotalReadCnt - The total number of reads lost to low frequency filtering for this sample (s_sName) for this target (p_targetName)
  • s_lowFreqTotalBarcodeCnt - The total number of barcodes lost to low frequency filtering for this sample (s_sName) for this target (p_targetName)
  • c_clusterID - The identifier for this cluster in this sample (s_sName) in this target (p_targetName), this will always be a number and starts at 0, the lower the number the more abundant the cluster is in the sample (s_sName)
  • c_name - A name for the cluster, this will have meta data incorporated into it for barcode and read counts etc.
  • c_readCnt - The total number of reads for this cluster (c_name)
  • c_readFrac - The fraction of reads for this cluster (c_name) out of the total reads for this sample (s_sName) for this target (p_targetName)
  • c_barcodeCnt - The total number of barcodes for this cluster (c_name)
  • c_barcodeFrac - The fraction of barcodes for this cluster (c_name) out of the total barcodes for this sample (s_sName) for this target (p_targetName)
  • c_seq - The DNA sequence of this cluster
  • c_qual - The per base quality scores for c_seq encoded in fastq quality scores
  • c_length - The length of the DNA sequence c_seq
  • c_bestExpected - If comparing to expected sequence, the best matching supplied sequence