Batch Paired-End File Interlacer
Posted on 2012-07-18 13:04:27
by Geert Vandeweyer
Creating Interlaced Paired-end data
Galaxy is a great platform to analyze next-generation sequencing data in non-cli manner. Currently, it has a few major drawbacks. One of those is the inabilility to start workflows on batches of paired end data. That's why I wrote this little script to interlace paired-end data in batch outside of galaxy. Galaxy itself can interlace paired end data sets, but again, not in batch mode.
For those that are interested, the script is shown below.
Notes:
- Script uses the qsubwrapper.sh command to submit jobs to a torque/pbs queue. This program is available on this webpage as well (under BASH category)
- Script can be run without HPC support, by using the -l (lowercase L) switch
#!/usr/bin/perl
$|++;
use Getopt::Std;
use Cwd;
# i : ids file
# f : forward
# r : reverse
# n : name for interlaced output.
# l : run local, without qsubwrapper.sh (torque)
getopts('i:f:r:n:l', \%opts); # option are in %opts
## open ids file
if (defined($opts{'i'})) {
if (!-d "Interlaced") {
print "Creating Output Directory: Interlaced\n";
mkdir("Interlaced");
}
open IN, $opts{'i'};
my $cwd = getcwd;
while (<IN>) {
chomp($_);
next if ($_ eq '');
my ($forward, $samplename) = split(/\t/,$_);
## check if forward exists
if (!-e "$forward") {
print "File '$forward' was not found. This file was skipped.";
next;
}
## try to create reverse
my $counter = 0;
$counter++ while ($forward =~ m/_R1_/g);
if ($counter != 1) {
print "Reverse name could not be derived from '$forward' due to zero or multiple occureces of '_R1_'. File was skipped\n";
next;
}
my $reverse = $forward;
$reverse =~ s/_R1_/_R2_/;
if (-e "$forward" && -e "$reverse") {
print "Sample $samplename found, starting\n";
if (!exists($opts{'l'})) {
my $cmd = "qsubwrapper.sh -q QueueName -d '$cwd' -N 'Interlace_$samplename' -m e -M 'Your.Mail\@domain.com' -o 'job_output/$samplename.o.txt' -l 'nodes=1:ppn=3,mem=1500mb' -e 'job_output/$samplename.e.txt' -- perl Interlace_Samples.pl -f '$forward' -r '$reverse' -n '$samplename'";
system($cmd);
}
else {
my $cmd = "cd '$cwd' && perl Interlace_Samples.pl -f '$forward' -r '$reverse' -n '$samplename'";
system($cmd);
}
}
else {
print "Data files for sample $samplename not found. skipping\n";
}
}
}
elsif (defined($opts{'f'}) && defined($opts{'r'}) && defined($opts{'n'}) ){
my $forward = $opts{'f'};
my $reverse = $opts{'r'};
my $output = "Interlaced/".$opts{'n'}.".interlaced.fastq";
my $oriout = $output;
$subfileidx = 0;
## checks
if (!-e "$forward" || !-e "$reverse") {
print "Forward or reverse file not found, exiting:\nForward: $forward\nReverse: $reverse\n";
exit();
}
while (-e "$output") {
$subfileidx++;
if ($subfileidx == 1) {
my $subname = substr($output,0,-6);
$output = $subname."_$subfileidx.fastq";
}
else {
$output =~ s/(.*interlaced_)(\d+)(\.fastq)/$1$subfileidx$3/;
}
}
if ($subfileidx > 0) {
# outfile was changed.
print "Output file ($oriout) was already found, Redirecting output to $output\n";
}
## launch interlacing.
$r1 = $forward; # rename (script copied from other script)
$r2 = $reverse;
## open IN files (gzipped, pass through zcat)
open R1, "zcat $r1 |";
open R2, "zcat $r2 |";
# counters and cache
my $printcounter = 0;
my $printstring = '';
while( my $readname = <R1>) {
# add /1 and /2 to readname for de-interlacing in galaxy
my $reverse = <R2> # discarded reverse readname
# compose interlaceable readname
chomp($readname);
$readname =~ s/(\S+)\s(.*)/$1/; # part before space holds important stuff according to wikipedia.
my $forwardname .= "$readname/1\n"; ## galaxy deinterlacer splits on /1 and /2
my $reversename = "$readname/2\n";
# entries forward
my $forward = $forwardname;
$forward .= <R1>;
$dump = <R1>;
$forward .= "+\n"; # start of qual, + is enough, might also be +readname
$forward .= <R1> ;
# remaining entries reverse
$reverse = $reversename;
$reverse .= <R2>;
$dump = <R2>;
$reverse .= "+\n"; # start of qual
$reverse .= <R2>;
$ok++;
## concat to printstring
$printstring .= $forward.$reverse;
$printcounter++;
if ($printcounter > 2000000) {
## print in batches to reduce network and disk load, also gzip the output file at highest compression.
print "printing 2M reads.\n";
# use this output for gzipped output.
#open OUT, "| gzip --best -c >> $output" or die("Could not open outfile '$output'");
open OUT, ">> $output" or die("Could not open outfile '$output'");
print OUT $printstring;
close OUT;
$printstring = '';
$printcounter = 0;
}
}
close R1;
close R2;
## print last ones
open OUT, ">>$output";
print OUT $printstring;
close OUT;
# stats
print "\nDONE\n####\n";
print "Total reads for ".$opts{'n'}." : $ok\n";
}
else {
print "No runtime options found. Exiting\n";
exit();
}
Using the script
todo (see comments)
galaxy, NGS, Paired-End, Perl, torque/pbs
Comments
Loading Comments