Sed command to find several patterns and print the pattern line and subsequent lines after

advertisements

I have a tab delimited file with 43,075 lines and 7 columns. I sorted the file by column 4 from the highest to the smaller value. Now I need to find 342 genes which ids are in column 2. See example below:

miR Target  Transcript  Score   Energy  Length miR  Length target
aae-bantam-3p   AAEL007110  AAEL007110-RA   28404   -565.77 22  1776
aae-let-7   AAEL007110  AAEL007110-RA   28404   -568.77 21  1776
aae-miR-1   AAEL007110  AAEL007110-RA   28404   -567.77 22  1776
aae-miR-100 AAEL007110  AAEL007110-RA   28404   -567.08 22  1776
aae-miR-11-3p   AAEL007110  AAEL007110-RA   28404   -564.03 22  1776
.
.
.
aae-bantam-3p   AAEL018149  AAEL018149-RA   28292   -569.7  22  1769
aae-bantam-5p   AAEL018149  AAEL018149-RA   28292   -570.93 23  1769
aae-let-7   AAEL018149  AAEL018149-RA   28292   -574.26 21  1769
aae-miR-1   AAEL018149  AAEL018149-RA   28292   -568.34 22  1769
aae-miR-10  AAEL018149  AAEL018149-RA   28292   -570.08 22  1769

The are 124 lines for each gene. However, I want to extract the top hits for each, for example top 5 genes since the file is sorted. I can do it for one gene with the following script:

sed -n '/AAEL018149/ {p;q}' myfile.csv > top-hits.csv

However, it prints only the line of the match. I was wondering if I could use a script to get all the 342 genes at once. It would be great if I could get the line of the match and the next 4. Then I would have the top 5 hits for each gene. Any suggestion will be welcome. Thanks


You can also use awk for this:

awk '++a[$2]<=5' myfile.csv

Here, $2 => 2nd column. Since file is already sorted based on 4th column, This will print the top 5 lines corresponding to each gene (2nd column). All 342 genes will be covered. Also the header line will be maintained.