python - R as a general purpose programming language -

January 15, 2015

i liked python before because python has rich built-in types sets, dicts, lists, tuples. these structures write short scripts process data.

on other side, r matlab, , has scalar, vector, data frame, array , list data types. lacks sets, dicts, tuples, etc. know list type powerful, lot of operations thought list processing. idea of using r general purpose language still vague.

(the following an example. not mean focus on text processing/mining.)

for example, need tf-idf counting set of news articles (say 200,000 articles in folder , sub folders).

after read files, need word-to-id mapping , other counting tasks. these tasks involve string manipulation , need containers set or map.

i know can use language these processing , load data r. maybe (for small things) putting preprocessing single r script better.

so question r have enough capability in kind of rich data structures in language level? or if not, packages provide extension r language?

i think r's data pre-processing capability--i.e., extracting data source , before analytics steps--has improved substantially in past 3 years (the length of time have been using r). use python daily , have past 7 years or so--its text-processing capabilities superb--and still wouldn't hesitate moment use r type of task mention.

a couple of provisos though. first, suggest looking closely @ couple of external packages set of tasks in q--in particular, hash (python-like key-value data structure), , stringr (consists of wrappers on less user-friendly string manipulation functions in the base library)

both stringr , hash available on cran.

> library(hash) > dx = hash(k1=453, k2=67, k3=913) > dx$k1   [1] 453 > dx = hash(keys=letters[1:5], values=1:5) > dx   <hash> containing 5 key-value pair(s).    : 1    b : 2    c : 3    d : 4    e : 5  > dx[a]   <hash> containing 1 key-value pair(s).   : 1  > library(stringr) > astring = 'onetwothree456seveneight' > ptn = '[0-9]{3,}' > = str_extract_all(astring, ptn) >   [[1]]   [2] "456"

it seems there large subset of r users whom text processing , text analytics comprise significant portion of day-to-day work--as evidenced cran's natural language processing task view (one of 20 such informal domain-oriented package collections). within task view package tm, package dedicated functions text mining. included in tm optimized functions processing tasks such 1 mentioned in q.

in addition, r has excellent selection of packages working interactively on reasonably large datasets (e.g., > 1 gb) without need set parallel processing infrastructure (but can exploit cluster if it's available). impressive of these in opinion set of packages under rubric "the bigmemory project" (cran) michael kane , john emerson @ yale; project subsumes bigmemory, biganalytics, synchronicity, bigtabulate, , bigalgebra. in sum, techniques behind these packages include: (i) allocating data shared memory, enables coordination of shared access separate concurrent processes single copy of data; (ii) file-backed data structures (which believe, not certain, synonymous memory-mapped file structure, , works enabling fast access disk using pointers avoiding ram limit on available file size).

still, quite few functions , data structures in r's standard library make easier work interactively data approaching ordinary ram limits. instance, .rdata, native binary format, simple possible use (the commands save , load) , has excellent compression:

> library(elemstatlearn) > data(spam) > format(object.size(spam), big.mark=',')   [1] "2,344,384" # 2.34 mb data file > save(spam, file='test.rdata')

this file, 'test.rdata' 176 kb, greater 10-fold compression.

Search This Blog

shell

python - R as a general purpose programming language -

Comments

Post a Comment

Popular posts from this blog

Add email recipient to all new Trac tickets -

400 Bad Request on Apache/PHP AddHandler wrapper -

asp.net - repeatedly call AddImageUrl(url) to assemble pdf document -