Regex expression to parse an interesting CSV? -
i need parse csv file using awk. line in csv this:
"hello, world?",1 thousand,"oneword",,,"last one"
some important observations: -field inside quoted string can contain commas , multiple words -unquoted field can multiple worlds -field can empty having 2 commas in row
any clues on writing regex expression split line properly?
thanks!
as many have observed, csv harder format first appears. there many edge cases , ambiguities. example ambiguity, in example, ',,,' field comma or 2 blank fields?
perl, python, java, etc better equipped deal csv because have tested libraries same. regex more fragile.
with awk, have had success this awk function. works under awk, gawk , nawk.
#!/usr/bin/awk -f #************************************************************************** # # file in public domain. # # more information email lorancestinson+csv@gmail.com. # or see http://lorance.freeshell.org/csv/ # # parse csv string array. # number of fields found returned. # in event of error negative value returned , csverr set # error. see below error values. # # parameters: # string = string parse. # csv = array parse fields into. # sep = field separator character. , # quote = string quote character. " # escape = quote escape character. " # newline = handle embedded newlines. provide either newline or # string use in place of newline. if left empty embedded # newlines cause error. # trim = when true spaces around separator removed. # affects parsing. without space between # separator , quote result in quote being ignored. # # these variables private: # fields = number of fields found far. # pos = pull field string. # strtrim = true when string found know remove quotes. # # error conditions: # -1 = unable read next line. # -2 = missing end quote. # -3 = missing separator. # # notes: # code assumes every field preceded separator, # first field. makes logic simpler, requires # separator prepended string before parsing. #************************************************************************** function parse_csv(string,csv,sep,quote,escape,newline,trim, fields,pos,strtrim) { # make sure there parse. if (length(string) == 0) return 0; string = sep string; # code below assumes ,field. fields = 0; # number of fields found far. while (length(string) > 0) { # remove spaces after separator if requested. if (trim && substr(string, 2, 1) == " ") { if (length(string) == 1) return fields; string = substr(string, 2); continue; } strtrim = 0; # used trim quotes off strings. # handle quoted field. if (substr(string, 2, 1) == quote) { pos = 2; { pos++ if (pos != length(string) && substr(string, pos, 1) == escape && (substr(string, pos + 1, 1) == quote || substr(string, pos + 1, 1) == escape)) { # remove escaped quote characters. string = substr(string, 1, pos - 1) substr(string, pos + 1); } else if (substr(string, pos, 1) == quote) { # found end of string. strtrim = 1; } else if (newline && pos >= length(string)) { # handle embedded newlines if requested. if (getline == -1) { csverr = "unable read next line."; return -1; } string = string newline $0; } } while (pos < length(string) && strtrim == 0) if (strtrim == 0) { csverr = "missing end quote."; return -2; } } else { # handle empty field. if (length(string) == 1 || substr(string, 2, 1) == sep) { csv[fields] = ""; fields++; if (length(string) == 1) return fields; string = substr(string, 2); continue; } # search separator. pos = index(substr(string, 2), sep); # if there no separator rest of string field. if (pos == 0) { csv[fields] = substr(string, 2); fields++; return fields; } } # remove spaces after separator if requested. if (trim && pos != length(string) && substr(string, pos + strtrim, 1) == " ") { trim = strtrim # count number fo spaces found. while (pos < length(string) && substr(string, pos + trim, 1) == " ") { trim++ } # remove them string. string = substr(string, 1, pos + strtrim - 1) substr(string, pos + trim); # adjust pos trimmed spaces if quotes string not found. if (!strtrim) { pos -= trim; } } # make sure @ end of string or there separator. if ((pos != length(string) && substr(string, pos + 1, 1) != sep)) { csverr = "missing separator."; return -3; } # gather field. csv[fields] = substr(string, 2 + strtrim, pos - (1 + strtrim * 2)); fields++; # remove field string next pass. string = substr(string, pos + 1); } return fields; } { num_fields = parse_csv($0, csv, ",", "\"", "\"", "\\n", 1); if (num_fields < 0) { printf "error: %s (%d) -> %s\n", csverr, num_fields, $0; } else { printf "%s -> \n", $0; printf "%s fields\n", num_fields; (i = 0;i < num_fields;i++) { printf "%s\n", csv[i]; } printf "|\n"; } }
running on example data produces:
"hello, world?",1 thousand,"oneword",,,"last one" -> 6 fields hello, world? 1 thousand oneword last 1 |
an example perl solution:
$ echo '"hello, world?",1 thousand,"oneword",,,"last one"' | perl -lne 'for(/(?:^|,)("(?:[^"]+|"")*"|[^,]*)/g) { s/"$//; s/""/"/g if (s/^"//); say}'
Comments
Post a Comment