I have a tab delimited file with 43,075 lines and 7 columns. I sorted the file by column 4 from the highest to the smaller value. Now I need to find 342 genes which ids are in column 2. See example below:
miR Target Transcript Score Energy Length miR Length target
aae-bantam-3p AAEL007110 AAEL007110-RA 28404 -565.77 22 1776
aae-let-7 AAEL007110 AAEL007110-RA 28404 -568.77 21 1776
aae-miR-1 AAEL007110 AAEL007110-RA 28404 -567.77 22 1776
aae-miR-100 AAEL007110 AAEL007110-RA 28404 -567.08 22 1776
aae-miR-11-3p AAEL007110 AAEL007110-RA 28404 -564.03 22 1776
.
.
.
aae-bantam-3p AAEL018149 AAEL018149-RA 28292 -569.7 22 1769
aae-bantam-5p AAEL018149 AAEL018149-RA 28292 -570.93 23 1769
aae-let-7 AAEL018149 AAEL018149-RA 28292 -574.26 21 1769
aae-miR-1 AAEL018149 AAEL018149-RA 28292 -568.34 22 1769
aae-miR-10 AAEL018149 AAEL018149-RA 28292 -570.08 22 1769
The are 124 lines for each gene. However, I want to extract the top hits for each, for example top 5 genes since the file is sorted. I can do it for one gene with the following script:
sed -n '/AAEL018149/ {p;q}' myfile.csv > top-hits.csv
However, it prints only the line of the match. I was wondering if I could use a script to get all the 342 genes at once. It would be great if I could get the line of the match and the next 4. Then I would have the top 5 hits for each gene. Any suggestion will be welcome. Thanks
You can also use awk
for this:
awk '++a[$2]<=5' myfile.csv
Here, $2
=> 2nd column. Since file is already sorted based on 4th
column, This will print the top 5
lines corresponding to each gene (2nd column). All 342 genes will be covered. Also the header line will be maintained.