R是否在定量服務台被Python取代了?


69

我知道標題聽起來有些極端,但我想知道R是否被賣方銀行和對沖基金中支持Python的大量量化服務台逐步淘汰。我的印像是,隨著Pandas,Numpy和其他Python包功能的改進,Python的功能也得到了極大的改進,以有意義地挖掘數據和建模時間序列。我還看到了通過Python實現的令人印象深刻的實現,以並行化代碼並將計算扇出到多個服務器/機器。我知道R中的某些軟件包也能夠做到這一點,但我只是感覺到當前的勢頭青睞Python。

我需要自己決定自己的建模框架子集的體系結構,並且需要一些其他方面對當前觀點的看法。

我還必須承認,我最初對Python的性能的保留主要是過時的,因為某些軟件包在後台大量使用了C實現,而且我看到的實現明顯優於高效編寫,編譯的OOP語言代碼。

請問您在使用什麼?我並不是要問您是否認為以下任務更好還是更壞,而是要特別說明為什麼使用R或Python,以及是否將它們歸為同一類來完成以下任務:

  • 獲取,存儲,維護,閱讀,清理時間序列
  • 對時間序列進行基本統計,對高級統計模型進行多元回歸分析,...
  • 執行數學計算(傅立葉變換,PDE求解器,PCA等)
  • 數據可視化(靜態和動態)
  • 定價衍生工具(諸如利率模型之類的定價模型的應用)
  • 互連性(與Excel,服務器,UI等)
  • (於2016年1月添加):具有設計,實施和訓練深度學習網絡的能力。

編輯我認為以下鏈接可能會增加一些價值,儘管它的日期過短[2013年](出於某種晦澀的原因,討論也已結束...):https://softwareengineering.stackexchange.com/questions/181342/r-vs-python-for-data-analysis

您還可以在r-bloggers網站上搜索一些有關R和Python軟件包之間的計算效率的帖子。正如某些答案中所解決的那樣,一方面是數據修剪,輸入數據的準備和設置。等式的另一部分是實際執行統計和數學計算時的計算效率。

更新(2016年1月)

由於銀行和對沖基金非常積極地追求AI /深度學習網絡,因此我想對此問題進行更新。我花了大量時間研究深度學習並進行了實驗,並與Theano,Torch和Caffe等庫一起工作。從我自己的工作以及與其他人的對話中脫穎而出的是,這些庫中有很多是通過Python使用的,並且該領域的大多數研究人員在該特定領域中並未使用R。現在,這仍然只佔金融服務業量化工作的一小部分,但我仍然想指出這一點,因為它直接涉及我所提出的問題。我添加了量化研究的這一方面,以反映當前的趨勢。

25

This is interesting because I see another trend: Matlab is being replaced by R, but I guess this is another story.

I use R for my academic (I am also teaching this stuff) as well as my consulting work (I am mainly working in the $\mathbb{P}$ area, with some excursions into $\mathbb{Q}$). I tried Python but it didn't work for me. I think the main reasons I will stick with R are:

  • especially in the area of statistics and analytics there is such a huge amount of high quality packages with sometimes even very recent methods which is unrivalled by any other language out there
  • for me R has the right mixture of low level capabilities of e.g. (re-)organizing data and high level commands (e.g. even k-means in the core package)
  • the speed is ok for me because I am not working in the area of HFT and there are many possibilities of speeding up code (vectorization, parallelization, good connectivity with C asf)
  • the community is really very much into the kind of stuff I am interesting in whereas with Python it is really everybody and his dog doing all kinds of stuff I am not interested in... I guess this is also about the mindset how to approach some problems, I don't know.

I think in general one should focus: I wouldn't try to build a webpage or a game with R but when it comes to statistics and analytics I think Python is no real competitor and I would strongly recommend R as your future setup.

Edit
I also wrote a blog post with additional points about why R is better suited for data science than Python: http://blog.ephorie.de/why-r-for-data-science-and-not-python


23

I've used both R and Python with Pandas in a professional quantitative financial work to do both large and small scale projects. I would strongly recommend Python with Pandas over R for most new projects in the field especially in time series analysis.

While I don't dispute vonjd in that you will find more libraries in R with algorithms on the bleeding edge of statistical research, the libraries in Python are very robust and fleshed out in that area. Also, I find in my work and the work of my colleagues that we are grabbing libraries from electrical engineering, computer vision, big data and more. People in these fields mostly have libraries in Python, not R.

However, the main advantage of Python over R in this field is workflow. The workflow with R tended to be that you used Perl/Python for data cleaning, preparation database work because R was too slow awkward for large complicated datasets though this is getting better. You then build the statistical model in R taking advantage of its libraries. Afterwards, the R model was rewritten in C for speed, control, interface, parallelization and error handling for production.

Python can handle this full workflow start to finish. All the inter-connectivity steps surrounding the main research projects is much more robust and a lot of time is saved in development when using the same language throughout. Also, with Pandas the even the core research portion and data handling is now easier and cleaner in my opinion.

In general, if you are just focusing only on advanced statistics/data-mining time series research then R and Python with Pandas are interchangeable at least for now. However it sounds like from your question that you are also are worried also about inter-connectivity and architecture for that Python is far superior.

Edit for 2018: It's amazing how much easier it is to get into data munging in Python these days compared to when I first wrote this. Try Anaconda for those that would like to check out Python/Pandas without any fuss.


13

For data analysis, particularly for large data analysis project, pretty much most of the top quant hedge funds and a lot of the banks are using Python (over R) for a couple of reasons but many still have bits and pieces of R for specific packages or functions (I work at a bank and interface with quite a few quant hedge funds on data analysis):

  1. Earlier Python 2 used to have a lot of backward compatibility issues, but Python 3 is more stable between versions. Even Pandas versions since 0.13 are very stable between versions. No one wants to use a language for which they have to revisit and rewrite significant codes sometime in the future.

  2. People needed same codes to run on both Linux and Windows. Installing, compiling packages in Python can be a super pain, whether Linux or Windows. A lot of people did not wanted to do any new project in Python 2 as sometime in the future one would need to move to Python 3 and they stuck to R for quite a while. Also for a while, Python 3 was available only with WinPython distro and WinPython used to work only on Windows. Anaconda, which is leading Python disto for Linux (& Mac), came out with Python 3 support sometime in 2014, which then caused a huge migration.

Advantages of Python (vs R):

(i) Raw speed is the biggest motive (allowing you to do way more statistical data analysis in the same time)

(ii) Pandas can read csv files very fast (one of the reasons why many folks moved from Matlab to R at some point)

(iii) Cython is more flexible than RCpp (at least my experience)

(iv) organize code files neatly into logical directories and classes within files (classes in R are an oversight) and the project looks much better

(v) As of 2015, PyCharm is a significantly better IDE than RStudio (although RStudio is better than Spyder). Tools matter

Disadvantages of Python (vs R):

(i) The big issue with Pandas used to be that it didn't have its own binary data format. R's RData format is a huge edge. PyData's HDF5 based storage is not compressible easily, gives a lot of errors every now and then, and for big data it was a hindrance. Pickle, and other formats didn't just cut it. After years of Python-vs-R exploration, most ended up writing their own custom binary data format (to store Pandas data frame) or using significant modifications of PostgreSQL for big data storage.

Statistical packages are generally great with both languages.

I have projects in R that took 4 hours to run every day (over night). Now, in Python, they take a total of 20 minutes (with much less use of Cython codes than RCpp codes in R). That's the speed difference for you.

To answer your question:

  • acquire, store, maintain, read, clean time series: Python is better

  • perform basic statistics on time series, advanced statistical models such as multivariate regression analyses, etc.: both Python and R

  • performing mathematical computations (fourier transforms, PDE solver, PCA) visualization of data (static and dynamic): both Python and R

    • pricing derivatives (application of pricing models such as interest rate models) : both Python and R

    • interconnectivity (with Excel, servers, UI): Python is better


6

For the tasks listed, both Python and R perform very well. There are some packages in Python not in R and vice versa. My solution for this is to simply call R from Python. This allows for the best of both worlds.

It is also important to note I do not write any R code other than calling an R library from Python.

Calling Python from R does not work equally across all major OSes as well.


47

My deal is HFT so what I care about is

  1. read/load data from file or DB quickly in memory
  2. perform very efficient data-munging operations (group,transform)
  3. visualize easily the data

I think is is pretty clear that 3. goes to R, graphics and ggplot2 and others allow you to plot anything from scratch with little effort.

About 1. and 2. I am amazed reading previous post to see that people are advocating for python based on pandas and that no one cites data.table The data.table is a fantastic package that allows blazing fast grouping/transforming of tables with 10s million rows. From this bench you can see that data.table is multiple time faster than pandas and much more stable (pandas tend to crash on massive tables)

Example

R) library(data.table)
R) DT = data.table(x=rnorm(2e7),y=rnorm(2e7),z=sample(letters,2e7,replace=T))
R) tables()
     NAME       NROW NCOL  MB COLS  KEY
[1,] DT   20,000,000    3 458 x,y,z    
Total: 458MB
R) system.time(DT[,.(sum(x),mean(y)),.(z)])
   user  system elapsed 
  0.226   0.037   0.264 

R)setkey(DT,z)
R)system.time(DT[,.(sum(x),mean(y)),.(z)])
  user  system elapsed 
  0.118   0.022   0.140 

Then there is speed, as I work in HFT neither R nor python can be used in production. But the Rcpp package allows you to write efficient C++ code and integrate it to R trivially (literally adding 2 lines). I doubt R is fading, given the number of new packages created every day and the momentum the language has...

EDIT 2018-07

A few years latter I am amazed by how the R ecosystem has evolved. For in-memory computation you get unmatched tools, from fst for blazing fast binary read/write, fork or cluster parallelism in one liners. C++ integration is incredibly easy with Rcpp. You get interactive graphics with the classics like plotly, crazy features like ggplotly (just makes your ggplot2 interactive). For trying python with pandas I honestly do not understand how there could even be a match. Syntax is clunky and performance is poor, I must be too used to R I guess. Another thing that is really missing in python is litterate programming, nothing comes close to rmarkdown (the best I could find in python was jupyter but that does even come close). With all the fuss surrounding the R vs Python langage war I realize that vast majority of people are simply uninformed, they do not know what data.table is, that it has nothing to do with a data.frame, they do not know that R fully supports tensorflow and keras.... To conclude I think both tools can do everything and it seems that python langage has very good PR...


3

The major advantage of Python (w/ pandas) over R is that Python supports OOP (object-oriented programming). It makes sense to organize a large code base using a hierarchy of classes. Python also supports the notion of polymorphism so that we can use well-known design patterns (e.g., Strategy, Observer, etc.) in our code.


28

Instead of wild guesses about R's/python's future in the community, here some facts:

The following query on StackExchange Data Explorer counts the number of questions that have <r> or <python> tags. If you scroll down on one of the three webpages provided below, you can see a graph with data on a monthly basis. You can easily run this query on databases for other sites as well (just go to "Switch sites" right below the query).

stats http://data.stackexchange.com/stats/query/350129/r-versus-python-tags#graph

stack http://data.stackexchange.com/stackoverflow/query/350129/r-versus-python-tags#graph

quant http://data.stackexchange.com/quant/query/350129/r-versus-python-tags#graph

The results:

  • In absolute terms, R has more hits for both stats.stackexchange.com and quant.stackexchange.com (the latter having very few data points). Python has more hits for stackoverflow.com.

  • In relative terms, the gap between R and python is closing for stackoverflow.com (ratio approx 1 to 3 at the moment). The ratio between R and python tags on stats.stackexchange.com is more or less stable since mid/end 2013 (roughly a factor 10 or a little above).

I really do think that the tag statistics in the stackexchange universe are a good indicator of the current interest in a particular programming language - probably even more so for its future popularity.

All-in-all, I am confident that the present data makes a strong case against Matt Wolf's hypothesis that "R might be obsolete in 3-4 years". ;)


Update: So now it's been 6 months since my initial answer. We still have to wait another 2.5-3.5 years to definitely see whether R has become obsolete. :) In the meantime, a quick addition due to Matt Wolf's comment. Here are variations of the above queries that give you the tag ratios (that's what I have been referring to in the second point of my answer). All ratios are python tags divided by R tags.

stats

http://data.stackexchange.com/stats/query/421036/r-versus-python-tags-quotient-py-r#graph

I do not see a clear trend here. The Py/R ratio is around 0.07 (there was a spike to 0.095 in November though). Since mid 2013, the ratio varies between 0.04 and 0.11. So I would call it relatively stable.

SO

http://data.stackexchange.com/stackoverflow/query/421032/r-versus-python-tags-quotient-py-r#graph

There was indeed a short term trend in favor of Python since Jul 15 (Py/R ratio went from 3.1 to 3.5). So the statement that "R is closing the gap wrt the Py/R ratio" could be called obsolete at the moment.

quant

http://data.stackexchange.com/quant/query/421042/r-versus-python-tags-quotient-py-r#graph

Still very noisy. Python did seem to catch up a little bit the last few months. But hard to tell with that little data.


6

Also in the high frequency / medium frequency field here.

I received a "mixed" consensus regarding the use of R and its prevalence in the field (specifically HFT). Speaking with someone who works in the equity option industry at a relatively small proprietary firm in San Francisco, I was told, "R is a legacy language".

However, speaking with someone who formerly was leading a HFT team at Goldman Sachs, I was told it is still the best language for time series analysis, statistics and especially latency sensitive projects. For libraries, the following were mentioned:

  1. Quantmod (See Quantmod)
  2. Caret (See Caret)
  3. Zoo (See Zoo)
  4. XTS (See XTS)
  5. highfrequency (See highfrequency: tools for high frequency data analysis)
  6. The popular open source QuantLib library also has an R version, which can be found here.

And to reiterate on other answers to this question, given how heavily dependent the HFT field is on speed, R cannot be integrated into production HFT systems. However, the R C++ Package is a popular tool which makes the integration to the HFT system both practical and easy.

I would not say R is dying, but it also does not have a monopoly for data analysis in the field of quantitative finance in general. Python and matlab are of great use in this field as well (I seem to be a minority in my use of matlab but it is great).