Stripping Log Files

I've got some log files looking something like this:

INFO 2012-08-09 20:23:58: >>>> Parsing the annotations for comparison
INFO 2012-08-09 20:23:58: Parsing sequence 'Alfie' (10)
INFO 2012-08-09 20:24:00: Parse results
INFO 2012-08-09 20:24:00: All parses succeeded!
INFO 2012-08-09 20:24:06: >>> Parsing iteration: 1
INFO 2012-08-09 20:24:06: Completed parsing up to node 1 / 67 (0.00 secs)
INFO 2012-08-09 20:24:06: Completed parsing up to node 2 / 67 (0.00 secs)
INFO 2012-08-09 20:24:06: Completed parsing up to node 3 / 67 (0.01 secs)
INFO 2012-08-09 20:24:06: Completed parsing up to node 4 / 67 (0.08 secs)

Not all lines have a timestamp - there are some multiline logs.

I want to strip them to remove the log timestamps from the beginning. Additionally, I want to get rid of those (x.xx secs) bits from the ends of the lines. The output should look like this:

>>>> Parsing the annotations for comparison
Parsing sequence 'Alfie' (10)
Parse results
All parses succeeded!
>>> Parsing iteration: 1
Completed parsing up to node 1 / 67
Completed parsing up to node 2 / 67
Completed parsing up to node 3 / 67
Completed parsing up to node 4 / 67

The reason is that I want to compare the log output from two experiments: obviously the timestamps and precise timings vary, but that's not important.

The following sequence of commands does it. The input is in the file cat1 and the output goes to cat1b:

awk '!/^INFO/ { print } /^INFO/ { $1="";$2="";$3="";print }' cat1|sed -e 's/^[ ]*//'|sed -e 's/([0-9]*\.[0-9]* secs)//' >cat1b

Explanation

First comes an awk command to remove the log timestamps. (It only handles INFO logs.)

awk '!/^INFO/ { print } /^INFO/ { $1="";$2="";$3="";print }' cat1

First it looks for lines without INFO at the beginning and prints them in full. Lines with INFO have their first three fields (INFO, date, time) removed.

This leaves spaces at the start of the line. These are easily removed by a sed command, which strips all initial spaces from each line:

sed -e 's/^[ ]*//'

Finally, I get rid of those parse times ((x.xx secs)) with another sed command:

sed -e 's/([0-9]*\.[0-9]* secs)//


CategoryTechnical

Mark: BashRecipes (last edited 2013-04-21 19:50:37 by MarkGranrothWilding)