Scribbles & Snippets

... of low and no utility ...

Latest Posts

Running Cromwell on AWS/Batch
- Posted: 2018-11-21.

Parallel mysql myisam repair
- Posted: 2018-11-21.

Does a TKI like Crizotinib kill tumor cells ?
- Posted: 2018-10-28.

Save Spotify to flac or mp3
- Posted: 2018-07-14.

Replace all symlinks by the original file
- Posted: 2018-03-02.

Log in

Batch Paired-End File Interlacer

Posted on 2012-07-18 13:04:27
by Geert Vandeweyer

Loading Content

Comments

Loading Comments

Batch Paired-End File Interlacer

Posted on 2012-07-18 13:04:27
by Geert Vandeweyer

Creating Interlaced Paired-end data

Galaxy is a great platform to analyze next-generation sequencing data in non-cli manner. Currently, it has a few major drawbacks. One of those is the inabilility to start workflows on batches of paired end data. That's why I wrote this little script to interlace paired-end data in batch outside of galaxy. Galaxy itself can interlace paired end data sets, but again, not in batch mode.

For those that are interested, the script is shown below.

Notes:

Script uses the qsubwrapper.sh command to submit jobs to a torque/pbs queue. This program is available on this webpage as well (under BASH category)
Script can be run without HPC support, by using the -l (lowercase L) switch

#!/usr/bin/perl
$|++;

use Getopt::Std;
use Cwd;
# i : ids file
# f : forward
# r : reverse
# n : name for interlaced output.
# l : run local, without qsubwrapper.sh (torque) 
getopts('i:f:r:n:l', \%opts);  # option are in %opts

## open ids file
if (defined($opts{'i'})) {
	if (!-d "Interlaced") {
		print "Creating Output Directory: Interlaced\n";
		mkdir("Interlaced");
	}
	open IN, $opts{'i'};
	my $cwd = getcwd;
	while (<IN>) {
		chomp($_);
		next if ($_ eq '');
		my ($forward, $samplename) = split(/\t/,$_);
		## check if forward exists
		if (!-e "$forward") {
			print "File '$forward' was not found. This file was skipped.";
			next;
		}
		## try to create reverse
		my $counter = 0;
		$counter++ while ($forward =~ m/_R1_/g);
		if ($counter != 1) {
			print "Reverse name could not be derived from '$forward' due to zero or multiple occureces of '_R1_'. File was skipped\n";
			next;
		}
		my $reverse = $forward;
		$reverse =~ s/_R1_/_R2_/;
		if (-e "$forward" && -e "$reverse") {
			print "Sample $samplename found, starting\n";
			if (!exists($opts{'l'})) {
				my $cmd = "qsubwrapper.sh -q QueueName -d '$cwd' -N 'Interlace_$samplename' -m e -M 'Your.Mail\@domain.com' -o 'job_output/$samplename.o.txt' -l 'nodes=1:ppn=3,mem=1500mb' -e 'job_output/$samplename.e.txt' -- perl Interlace_Samples.pl -f '$forward' -r '$reverse' -n '$samplename'";
				system($cmd);
			}
			else {
				my $cmd = "cd '$cwd' && perl Interlace_Samples.pl -f '$forward' -r '$reverse' -n '$samplename'";
				system($cmd);

			}
		}
		else {
			print "Data files for sample $samplename not found. skipping\n";
		}
		
	}
}
elsif (defined($opts{'f'}) && defined($opts{'r'}) && defined($opts{'n'}) ){
	my $forward = $opts{'f'};
	my $reverse = $opts{'r'};
	my $output = "Interlaced/".$opts{'n'}.".interlaced.fastq";
	my $oriout = $output;
	$subfileidx = 0;
	## checks
	if (!-e "$forward" || !-e "$reverse") {
		print "Forward or reverse file not found, exiting:\nForward: $forward\nReverse: $reverse\n";
		exit();
	}
	while (-e "$output") {
		$subfileidx++;
		if ($subfileidx == 1) {
			my $subname = substr($output,0,-6);
			$output = $subname."_$subfileidx.fastq";
		}
		else {
			$output =~ s/(.*interlaced_)(\d+)(\.fastq)/$1$subfileidx$3/;
		}
	}
	if ($subfileidx > 0) {
		# outfile was changed.
		print "Output file ($oriout) was already found, Redirecting output to $output\n";
	}
	## launch interlacing.
	$r1 = $forward; # rename (script copied from other script)
	$r2 = $reverse;
	## open IN files (gzipped, pass through zcat)
	open R1, "zcat $r1 |";
	open R2, "zcat $r2 |";
	# counters and cache
	my $printcounter = 0;
	my $printstring = '';
	while( my $readname = <R1>) {
		# add /1 and /2 to readname for de-interlacing in galaxy
		my $reverse = <R2> # discarded reverse readname
		# compose interlaceable readname
		chomp($readname);
		$readname =~ s/(\S+)\s(.*)/$1/; # part before space holds important stuff according to wikipedia.
		my $forwardname .= "$readname/1\n"; ## galaxy deinterlacer splits on /1 and /2
		my $reversename = "$readname/2\n";
		#  entries forward
		my $forward = $forwardname;
		$forward .= <R1>;
		$dump = <R1>;
		$forward .= "+\n"; # start of qual, + is enough, might also be +readname
		$forward .= <R1> ;
		# remaining entries reverse
		$reverse = $reversename;
		$reverse .= <R2>;
		$dump = <R2>;
		$reverse .= "+\n"; # start of qual
		$reverse .= <R2>;
		$ok++;
		## concat to printstring
		$printstring .= $forward.$reverse;
		$printcounter++;
		if ($printcounter > 2000000) {
			## print in batches to reduce network and disk load, also gzip the output file at highest compression.
			print "printing 2M reads.\n";
 			# use this output for gzipped output.
			#open OUT, "| gzip --best -c >> $output" or die("Could not open outfile '$output'");
			open OUT, ">> $output" or die("Could not open outfile '$output'");
			print OUT $printstring;
			close OUT;
			$printstring = '';
			$printcounter = 0;
		}
	}
	close R1;
	close R2;

	## print last ones
        open OUT, ">>$output";
        print OUT $printstring;
        close OUT;
	# stats
	print "\nDONE\n####\n";
	print "Total reads for ".$opts{'n'}." : $ok\n";
}
else {
	print "No runtime options found. Exiting\n";
	exit();
}

Using the script

todo (see comments)

galaxy, NGS, Paired-End, Perl, torque/pbs

Comments

Geert VandeweyerLast Edited On 2012-07-18 15:04:12

Input file Format

The script should be run using

perl Interlace_Samples.pl -i FastQ.files

This FastQ.Filies is a file containing one sample per line, specifying the path to the forward read, and the samplename, seperated by a "TAB". The reverse reads are composed by replacing _R1_ by _R2_. The easiest way is to run a command similar to the follwoing in the directory containing the fastq files, and edititing the output by adding the sample names. The interlacer assumes fastq.gz (gzipped) files, and outputs regular, uncompressed fastq files, compatible with the galaxy built-in deinterlacer.

find -type f | grep 'fastq.gz' | grep '_R1_' > FastQ.files

After Interlacing, results will be placed in the "Interlaced" subdirectory. If using torque/pbs, make sure you create the job_output directory before starting the job.

A sample input file is shown below, where the fastq files are located in subdirectories per sample:

./2620/index1_ATCACG_L006_R1_001.fastq.gz	Sample_93875	
./2621/index13_AGTCAA_L006_R1_001.fastq.gz	Sample_97459
./2622/index17_GTAGAG_L006_R1_001.fastq.gz	Sample_63182

Categories

Latest Posts

Tag Collection

Log in

Batch Paired-End File Interlacer

Posted on 2012-07-18 13:04:27
by Geert Vandeweyer

Comments

Batch Paired-End File Interlacer

Posted on 2012-07-18 13:04:27
by Geert Vandeweyer

Creating Interlaced Paired-end data

Using the script

galaxy, NGS, Paired-End, Perl, torque/pbs

Comments

Input file Format

Leave Comment

Loading...

Categories

Latest Posts

Tag Collection

Log in

Batch Paired-End File Interlacer

Posted on 2012-07-18 13:04:27by Geert Vandeweyer

Comments

Batch Paired-End File Interlacer

Posted on 2012-07-18 13:04:27by Geert Vandeweyer

Creating Interlaced Paired-end data

Using the script

galaxy, NGS, Paired-End, Perl, torque/pbs

Comments

Input file Format

Leave Comment

Loading...

Posted on 2012-07-18 13:04:27
by Geert Vandeweyer

Posted on 2012-07-18 13:04:27
by Geert Vandeweyer