Regex expression to parse an interesting CSV? -

i need parse csv file using awk. line in csv this:

"hello, world?",1 thousand,"oneword",,,"last one"

some important observations: -field inside quoted string can contain commas , multiple words -unquoted field can multiple worlds -field can empty having 2 commas in row

any clues on writing regex expression split line properly?

thanks!

as many have observed, csv harder format first appears. there many edge cases , ambiguities. example ambiguity, in example, ',,,' field comma or 2 blank fields?

perl, python, java, etc better equipped deal csv because have tested libraries same. regex more fragile.

with awk, have had success this awk function. works under awk, gawk , nawk.

#!/usr/bin/awk -f #************************************************************************** # # file in public domain. # # more information email lorancestinson+csv@gmail.com. # or see http://lorance.freeshell.org/csv/ # # parse csv string array. # number of fields found returned. # in event of error negative value returned , csverr set # error. see below error values. # # parameters: # string  = string parse. # csv     = array parse fields into. # sep     = field separator character. , # quote   = string quote character. " # escape  = quote escape character. " # newline = handle embedded newlines. provide either newline or #           string use in place of newline. if left empty embedded #           newlines cause error. # trim    = when true spaces around separator removed. #           affects parsing. without space between #           separator , quote result in quote being ignored. # # these variables private: # fields  = number of fields found far. # pos     = pull field string. # strtrim = true when string found know remove quotes. # # error conditions: # -1  = unable read next line. # -2  = missing end quote. # -3  = missing separator. # # notes: # code assumes every field preceded separator, # first field. makes logic simpler, requires # separator prepended string before parsing. #************************************************************************** function parse_csv(string,csv,sep,quote,escape,newline,trim, fields,pos,strtrim) {     # make sure there parse.     if (length(string) == 0) return 0;     string = sep string; # code below assumes ,field.     fields = 0; # number of fields found far.     while (length(string) > 0) {         # remove spaces after separator if requested.         if (trim && substr(string, 2, 1) == " ") {             if (length(string) == 1) return fields;             string = substr(string, 2);             continue;         }         strtrim = 0; # used trim quotes off strings.         # handle quoted field.         if (substr(string, 2, 1) == quote) {             pos = 2;             {                 pos++                 if (pos != length(string) &&                     substr(string, pos, 1) == escape &&                     (substr(string, pos + 1, 1) == quote ||                      substr(string, pos + 1, 1) == escape)) {                     # remove escaped quote characters.                     string = substr(string, 1, pos - 1) substr(string, pos + 1);                 } else if (substr(string, pos, 1) == quote) {                     # found end of string.                     strtrim = 1;                 } else if (newline && pos >= length(string)) {                     # handle embedded newlines if requested.                     if (getline == -1) {                         csverr = "unable read next line.";                         return -1;                     }                     string = string newline $0;                 }             } while (pos < length(string) && strtrim == 0)             if (strtrim == 0) {                 csverr = "missing end quote.";                 return -2;             }         } else {             # handle empty field.             if (length(string) == 1 || substr(string, 2, 1) == sep) {                 csv[fields] = "";                 fields++;                 if (length(string) == 1)                     return fields;                 string = substr(string, 2);                 continue;             }             # search separator.             pos = index(substr(string, 2), sep);             # if there no separator rest of string field.             if (pos == 0) {                 csv[fields] = substr(string, 2);                 fields++;                 return fields;             }         }         # remove spaces after separator if requested.         if (trim && pos != length(string) && substr(string, pos + strtrim, 1) == " ") {             trim = strtrim             # count number fo spaces found.             while (pos < length(string) && substr(string, pos + trim, 1) == " ") {                 trim++             }             # remove them string.             string = substr(string, 1, pos + strtrim - 1) substr(string,  pos + trim);             # adjust pos trimmed spaces if quotes string not found.             if (!strtrim) {                 pos -= trim;             }         }         # make sure @ end of string or there separator.         if ((pos != length(string) && substr(string, pos + 1, 1) != sep)) {             csverr = "missing separator.";             return -3;         }         # gather field.         csv[fields] = substr(string, 2 + strtrim, pos - (1 + strtrim * 2));         fields++;         # remove field string next pass.         string = substr(string, pos + 1);     }     return fields; }  {     num_fields = parse_csv($0, csv, ",", "\"", "\"", "\\n", 1);     if (num_fields < 0) {         printf "error: %s (%d) -> %s\n", csverr, num_fields, $0;     } else {         printf "%s -> \n", $0;         printf "%s fields\n", num_fields;         (i = 0;i < num_fields;i++) {             printf "%s\n", csv[i];         }         printf "|\n";     } }

running on example data produces:

"hello, world?",1 thousand,"oneword",,,"last one" ->  6 fields hello, world? 1 thousand oneword   last 1 |

an example perl solution:

$ echo '"hello, world?",1 thousand,"oneword",,,"last one"' |  perl -lne 'for(/(?:^|,)("(?:[^"]+|"")*"|[^,]*)/g) { s/"$//; s/""/"/g if (s/^"//); say}'

Search This Blog

shell

Regex expression to parse an interesting CSV? -

Comments

Post a Comment

Popular posts from this blog

Add email recipient to all new Trac tickets -

400 Bad Request on Apache/PHP AddHandler wrapper -

asp.net - repeatedly call AddImageUrl(url) to assemble pdf document -