python - R as a general purpose programming language -
i liked python before because python has rich built-in types sets, dicts, lists, tuples. these structures write short scripts process data.
on other side, r matlab, , has scalar, vector, data frame, array , list data types. lacks sets, dicts, tuples, etc. know list type powerful, lot of operations thought list processing. idea of using r general purpose language still vague.
(the following an example. not mean focus on text processing/mining.)
for example, need tf-idf counting set of news articles (say 200,000 articles in folder , sub folders).
after read files, need word-to-id mapping , other counting tasks. these tasks involve string manipulation , need containers set or map.
i know can use language these processing , load data r. maybe (for small things) putting preprocessing single r script better.
so question r have enough capability in kind of rich data structures in language level? or if not, packages provide extension r language?
i think r's data pre-processing capability--i.e., extracting data source , before analytics steps--has improved substantially in past 3 years (the length of time have been using r). use python daily , have past 7 years or so--its text-processing capabilities superb--and still wouldn't hesitate moment use r type of task mention.
a couple of provisos though. first, suggest looking closely @ couple of external packages set of tasks in q--in particular, hash (python-like key-value data structure), , stringr (consists of wrappers on less user-friendly string manipulation functions in the base library)
both stringr , hash available on cran.
> library(hash) > dx = hash(k1=453, k2=67, k3=913) > dx$k1 [1] 453 > dx = hash(keys=letters[1:5], values=1:5) > dx <hash> containing 5 key-value pair(s). : 1 b : 2 c : 3 d : 4 e : 5 > dx[a] <hash> containing 1 key-value pair(s). : 1 > library(stringr) > astring = 'onetwothree456seveneight' > ptn = '[0-9]{3,}' > = str_extract_all(astring, ptn) > [[1]] [2] "456"
it seems there large subset of r users whom text processing , text analytics comprise significant portion of day-to-day work--as evidenced cran's natural language processing task view (one of 20 such informal domain-oriented package collections). within task view package tm, package dedicated functions text mining. included in tm optimized functions processing tasks such 1 mentioned in q.
in addition, r has excellent selection of packages working interactively on reasonably large datasets (e.g., > 1 gb) without need set parallel processing infrastructure (but can exploit cluster if it's available). impressive of these in opinion set of packages under rubric "the bigmemory project" (cran) michael kane , john emerson @ yale; project subsumes bigmemory, biganalytics, synchronicity, bigtabulate, , bigalgebra. in sum, techniques behind these packages include: (i) allocating data shared memory, enables coordination of shared access separate concurrent processes single copy of data; (ii) file-backed data structures (which believe, not certain, synonymous memory-mapped file structure, , works enabling fast access disk using pointers avoiding ram limit on available file size).
still, quite few functions , data structures in r's standard library make easier work interactively data approaching ordinary ram limits. instance, .rdata, native binary format, simple possible use (the commands save , load) , has excellent compression:
> library(elemstatlearn) > data(spam) > format(object.size(spam), big.mark=',') [1] "2,344,384" # 2.34 mb data file > save(spam, file='test.rdata')
this file, 'test.rdata' 176 kb, greater 10-fold compression.
Comments
Post a Comment