Perl to remove new lines from the input file and write to the output file

advertisements

My input is a file with a series of strings, something like this:
ABCDEFGHIJKL
XKGKASKGKD
SGJKSKGS

I want to remove the new lines and write the output to a new file: ABCDEFGHIJKLXKGKASKGKDSGJKSKGS

So far I have this:

#! /usr/bin/perl
use strict;
use warnings;
my $input = $ARGV[0];
my $output = "concatenated.txt"; #Output sequence
#Open output file
open >"$output"; or die "unable to open $output";
#Open input file
open "<$input" or die "unable to open $input";

while (<INPUT>) {
   if( /^[AGCT]/)
   chomp;

print $output;
}

close $input;

close $output;

print "Done!\n";

But it doesn't work yet.
Is chomp enough to concatenate in this instance?
How do I write what I have created to the output file?


If I am guessing right :-) If your file looks like this with more than 1 sequence, you would need to keep newlines after the id line and the last line of the sequence.

A perl 1 liner could be this:

perl -0777 -pe 's/^[TAGC]+\K\n(?!>)//gm' fasta.txt > concatenated.txt

The -0777 says to slurp the whole file into one string.

This substitution says to match all [TAGC] starting at the beginning of the line, (with \K, keep everything before, don't erase). Then a newline \n that is not followed by a >, (beginning of the following line is id).

This erases the new line provided it is a sequence line and not followed by a new id line. The g switch says to do this globally and the m switch allows the caret, ^, to match at the beginning of a line rather than its usual behavior, matching at the beginning of the string.

>NR_037701 1
AGGAGCTATGAATATTAATGAAAGTGGTCCTGATGCATGCATATTAAACA
TGCATCTTACATATGACACATGTTCACCTTGGGGTGGAGACTTAATATTT
AAATATTGCAATCAGGCCCTATACATCAAAAGGTCTATTCAGGACATGAA
GGCACTCAAGTATGCAATCTCTGTAAACCCGCTAGAACCAGTCATGGTCG
GTGGGCTCCTTACCAGGAGAAAATTACCGAAATCACTCTTGTCCAATCAA
AGCTGTAGTTATGGCTGGTGGAGTTCAGTTAGTCAGCATCTGGTGGAGCT
GCAAGTGTTTTAGTATTGTTTATTTAGAGGCCAGTGCTTATTTAGCTGCT
AGAGAAAAGGAAAACTTGTGGCAGTTAGAACATAGTTTATTCTTTTAAGT
GTAGGGCTGCATGACTTAACCCTTGTTTGGCATGGCCTTAGGTCCTGTTT
GTAATTTGGTATCTTGTTGCCACAAAGAGTGTGTTTGGTCAGTCTTATGA
CCTCTATTTTGACATTAATGCTGGTTGGTTGTGTCTAAACCATAAAAGGG
AGGGGAGTATAATGAGGTGTGTCTGACCTCTTGTCCTGTCATGGCTGGGA
ACTCAGTTTCTAAGGTTTTTCTGGGGTCCTCTTTGCCAAGAGCGTTTCTA
TTCAGTTGGTGGAGGGGACTTAGGATTTTATTTTTAGTTTGCAGCCAGGG
TCAGTACATTTCAGTCACCCCCGCCCAGCCCTCCTGATCCTCCTGTCATT
CCTCACATCCTGTCATTGTCAGAGATTTTACAGATATAGAGCTGAATCAT
TTCCTGCCATCTCTTTTAACACACAGGCCTCCCAGATCTTTCTAACCCAG
GACCTACTTGGAAAGGCATGCTGGGTCTCTTCCACAGACTTTAAGCTCTC
CCTACACCAGAATTTAGGTGAGTGCTTTGAGGACATGAAGCTATTCCTCC
CACCACCAGTAGCCTTGGGCTGGCCCACGCCAACTGTGGAGCTGGAGCGG
GAGGGAGGAGTACAGACATGGAATTTTAATTCTGTAATCCAGGGCTTCAG
TTATGTACAACATCCATGCCATTTGATGATTCCACCACTCCTTTTCCATC
TCCCAGAAGCCTGCTTTTTAATGCCCGCTTAATATTATCAGAGCCGAGCC
TGGAATCAAACTGCCTCTTTCAAAACCTGCCACTATATCCTGGCTTTGTG
ACCTCAGCCAAGTTGCTTGACTATTCTCAGTCTCAGTTTCTGCACCTGTC
AAATAGGGTTTATGTTAACCTAACTTTCAGGGCTGTCAGGATTAAATGAG
CATGAACCACATAAAATGTTTGGTGTATAGTAAGTGTACAGTAAATACTT
CCATTATCAGTCCCTGCAATTCTATTTTTCTTCCTTCTCTACACAGCCCC
TGTCTGGCTTTAAAATGTCCTGCCCTGCTTTTTATGAGTGGATACCCCCA
GCCCTATGTGGATTAGCAAGTTAAGTAATGACACTCAGAGACAGTTCCAT
CTTTGTCCATAACTTGCTCTGTGATCCAGTGTGCATCACTCAAACAGACT
ATCTCTTTTCTCCTACAAAACAGACAGCTGCCTCTCAGATAATGTTGGGG
GCATAGGAGGAATGGGAAGCCCGCTAAGAGAACAGAAGTCAAAAACAGTT
GGGTTCTAGATGGGAGGAGGTGTGCGTGCACATGTATGTTTGTGTTTCAG
GTCTTGGAATCTCAGCAGGTCAGTCACATTGCAGTGTGTCGCTTCACCTG
GCTCCCTCTTTTAAAGATTTTCCTTCCCTCTTTCCAACTCCCTGGGTCCT
GGATCCTCCAACAGTGTCAGGGTTAGATGCCTTTTATGGGCCACTTGCAT
TAGTGTCCTGATAGAGGCTTAATCACTGCTCAGAAACTGCCTTCTGCCCA
CTGGCAAAGGGAGGCAGGGGAAATACATGATTCTAATTAATGGTCCAGGC
AGAGAGGACACTCAGAATTTCAGGACTGAAGAGTATACATGTGTGTGATG
GTAAATGGGCAAAAATCATCCCTTGGCTTCTCATGCATAATGCATGGGCA
CACAGACTCAAACCCTCTCTCACACACATACACATATACATTGTTATTCC
ACACACAAGGCATAATCCCAGTGTCCAGTGCACATGCATACACGCACACA
TTCCCTTCCTAGGCCACTGTATTGCTTTCCTAGGGCATCTTCTTATAAGA
CACCAGTCGTATAAGGAGCCCACCCCACTCATCTGAGCTTATCAACCAAT
TACATTAGGAAAGACTGTATTTCCTAGTAAGGTCACATTCAGTAGTACTG
AGGGTTGGGACTTCAACACAGCTTTTTGGGGGATCATAATTCAACCCATG
ACAGCCACTGAGATTATTATATCTCCAGAGAATAAATGTGTGGAGTTAAA
AGGAAGATACATGTGGTACAAGGGGTGGTAAGGCAAGGGTAAAAGGGGAG
GGAGGGGATTGAACTAGACACAGACACATGAGCAGGACTTTGGGGAGTGT
GTTTTATATCTGTCAGATGCCTAGAACAGCACCTGAAATATGGGACTCAA
TCATTTTAGTCCCCTTCTTTCTATAAGTGTGTGTGTGCGGATATGTGTGC
TAGATGTTCTTGCTGTGTTAGGAGGTGATAAACATTTGTCCATGTTATAT
AGGTGGAAAGGGTCAGACTACTAAATTGTGAAGACATCATCTGTCTGCAT
TTATTGAGAATGTGAATATGAAACAAGCTGCAAGTATTCTATAAATGTTC
ACTGTTATTAGATATTGTATGTCTTTGTGTCCTTTTATTCATGAATTCTT
GCACATTATGAAGAAAGAGTCCATGTGGTCAGTGTCTTACCCGGTGTAGG
GTAAATGCACCTGATAGCAATAACTTAAGCACACCTTTATAATGACCCTA
TATGGCAGATGCTCCTGAATGTGTGTTTCGAGCTAGAAAATCCGGGAGTG
GCCAATCGGAGATTCGTTTCTTATCTATAATAGACATCTGAGCCCCTGGC
CCATCCCATGAAACCCAGGCTGTAGAGAGGATTGAGGCCTTAAGTTTTGG
GTTAAATGACAGTTGCCAGGTGTCGCTCATTAGGGAAAGGGGTTAAGTGA
AAATGCTGTATAAACTGCATGATGTTTGCAGGCAGTTGTGGTTTTCCTGC
CCAGCCTGCCACCACCGGGCCATGCGGATATGTTGTCCAGCCCAACACCA
CAGGACCATTTCTGTATGTAAGACAATTCTATCCAGCCCGCCACCTCTGG
ACTCCCTCCCCTGTATGTAAGCCCTCAATAAAACCCCACGTCTCTTTTGC
TGGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAA
>NM_198399 1
AACAGATTTTAACTCTGAAAAGCCATTTCCAGTGTCTATAGACTATTGTG
AGCCTGGAGAAGTAGCATTTAGTTGGGATAGCTTCACTAGAGCTGCCTGC
CAAAGACTTCCTTCCACAGGATCTTGTCGCACCAGCAACTGACAGGAGCT
TGGGAGCTCGGGAGCTTGGGAGAGGGCTTATGTTTTTAATAATGTAGCTG
TCAGTTCGAAGCCTGGAAATGTTGACCCTCAAAGGGCATAAAATCTTGTT
ATTTTAATTTGCATCTGGGAGAATGTCTGAGCAAGGAGACCTGAATCAGG
CAATAGCAGAGGAAGGAGGGACTGAGCAGGAGACGGCCACTCCAGAGAAC
GGCATTGTTAAATCAGAAAGTCTGGATGAAGAGGAGAAACTGGAACTGCA
GAGGCGGCTGGAGGCTCAGAATCAAGAAAGAAGAAAATCCAAGTCAGGAG
CAGGAAAAGGTAAACTGACTCGCAGCCTTGCTGTCTGTGAGGAATCTTCT
GCCAGACCAGGAGGTGAAAGTCTTCAGGATCAGACTCTCTGAAAACTGCA
AATGGAAAGGAATTCAAAAGAATTTAGATTAAAAGTTAAATAAAAAGTAG
GCACAGTAGTGCTGAATTTTCCTCAAAGGCTCTCTTTTGATAAGGCTGAA
CCAAATATAATCCCAAGTATCCTCTCTCCTTCCTTGTTGGAGATGTCTTA
CCTCTCAGCTCCCCAAAATGCACTTGCCTATAAGAAACACAATTGCTGGT
TCATATGAAACTTAGGAAATAGTGAATAAGGTGCATTTAACTTTGGAGAA
ATACTTTTATGGCTTTGGTGGAGATTTCTCAATACTGCAAAAGTTGTCCA
GAAATGAATCTGAGCTGATGGTGACTTTAAGTTAATATTATTAATATATC
ACTGCATATTTTTACCCTTATTTTTGCTCCTTACAGCAAGATTAGTAGGT
TATAAAAATTTAAATTTAAACAAAATTATTTCATGACAAAATGGGAAACT
TCACATCATACTTATTTTTGTTTGCCTTTCAGGCATCATATTAGCTTTTA
TAAAAAATGGTCTTGCTGCTGAAATTGTACTTATTTTATCAGAGGCTGGG
TGCAGTCAAGACAAAAGTAAAATGGTTTACCTGAGCCCAGGGGAGGGAAA
ATTGATTAAGATATCATTATTTTTGTTTGGTTTGGTTTTGCTTTTTTCCT
CTTACTTTAATTGAAATACTCTGAATTCCCCTCATGGAAACAGAGAGCAT
TGAGAGCACTTTCTTTAAAAGGACCAAAAATAAATTCCTAATAGATTTTG

Update If you need the solution in a script, then the following would produce the same results as the command line.

The command line would be perl yourscript.pl fasta.txt Note that instead of explicitly opening 'fasta.txt', I used the empty filehandle, <>. That reads in the fasta file specified on the command line.

#!/usr/bin/perl
use strict;
use warnings;

# Output sequence
my $output = "concatenated.txt";

open my $handle, '>', $output or die "unable to open $output";

my $current = <>;

while (my $next = <>) {

    # if current line is seq characters (not a header)
    # AND the next line isn't a header
    if (substr($current, 0, 1) ne '>' && substr($next, 0, 1) ne '>') {
        chomp($current)
    }

    print $handle $current;
    $current = $next;

    # print last line if at the end of file
    print $handle $current if eof;
}