Python作為統計工作台


363

許多人使用主要工具(例如Excel或其他電子表格,SPSS,Stata或R)來滿足其統計需求。他們可能會針對非常特殊的需求轉向一些特定的程序包,但是使用簡單的電子表格或常規統計信息包或統計信息編程環境可以完成很多事情。

我一直很喜歡Python作為一種編程語言,並且出於簡單的需求,很容易編寫一個簡短的程序來計算我的需求。Matplotlib讓我可以繪製它。

有人將R完全轉換為Python嗎?R(或任何其他統計信息包)具有許多特定於統計信息的功能,並且它具有的數據結構使您可以考慮要執行的統計信息,而不必考慮數據的內部表示形式。Python(或其他一些動態語言)的好處是允許我使用熟悉的高級語言進行編程,它使我可以與駐留數據或可以進行測量的真實係統進行編程交互。但是我還沒有找到任何Python軟件包可以讓我用"統計術語"表達事物-從簡單的描述統計到更複雜的多元方法。

如果我想使用Python作為"統計工作台"來代替R,SPSS等,您能推薦什麼?

根據您的經驗,我會得失什麼?

40

The following StackOverflow discussions might be useful


61

I don't think there's any argument that the range of statistical packages in cran and Bioconductor far exceed anything on offer from other languages, however, that isn't the only thing to consider.

In my research, I use R when I can but sometimes R is just too slow. For example, a large MCMC run.

Recently, I combined python and C to tackle this problem. Brief summary: fitting a large stochastic population model with ~60 parameters and inferring around 150 latent states using MCMC.

  1. Read in the data in python
  2. Construct the C data structures in python using ctypes.
  3. Using a python for loop, call the C functions that updated parameters and calculated the likelihood.

A quick calculation showed that the programme spent 95% in C functions. However, I didn't have to write painful C code to read in data or construct C data structures.


I know there's also rpy, where python can call R functions. This can be useful, but if you're "just" doing statistics then I would use R.


29

One benefit of moving to Python is the possibility to do more work in one language. Python is a reasonable choice for number crunching, writing web sites, administrative scripting, etc. So if you do your statistics in Python, you wouldn't have to switch languages to do other programming tasks.

Update: On January 26, 2011 Microsoft Research announced Sho, a new Python-based environment for data analysis. I haven't had a chance to try it yet, but it sounds like an interesting possibility if want to run Python and also interact with .NET libraries.


312

It's hard to ignore the wealth of statistical packages available in R/CRAN. That said, I spend a lot of time in Python land and would never dissuade anyone from having as much fun as I do. :) Here are some libraries/links you might find useful for statistical work.

  • NumPy/Scipy You probably know about these already. But let me point out the Cookbook where you can read about many statistical facilities already available and the Example List which is a great reference for functions (including data manipulation and other operations). Another handy reference is John Cook's Distributions in Scipy.

  • pandas This is a really nice library for working with statistical data -- tabular data, time series, panel data. Includes many builtin functions for data summaries, grouping/aggregation, pivoting. Also has a statistics/econometrics library.

  • larry Labeled array that plays nice with NumPy. Provides statistical functions not present in NumPy and good for data manipulation.

  • python-statlib A fairly recent effort which combined a number of scattered statistics libraries. Useful for basic and descriptive statistics if you're not using NumPy or pandas.

  • statsmodels Statistical modeling: Linear models, GLMs, among others.

  • scikits Statistical and scientific computing packages -- notably smoothing, optimization and machine learning.

  • PyMC For your Bayesian/MCMC/hierarchical modeling needs. Highly recommended.

  • PyMix Mixture models.

  • Biopython Useful for loading your biological data into python, and provides some rudimentary statistical/ machine learning tools for analysis.

If speed becomes a problem, consider Theano -- used with good success by the deep learning people.

There's plenty of other stuff out there, but this is what I find the most useful along the lines you mentioned.


95

First, let me say I agree with John D Cook's answer: Python is not a Domain Specific Language like R, and accordingly, there is a lot more you'll be able to do with it further down the road. Of course, R being a DSL means that the latest algorithms published in JASA will almost certainly be in R. If you are doing mostly ad hoc work and want to experiment with the latest lasso regression technique, say, R is hard to beat. If you are doing more production analytical work, integrating with existing software and environments, and concerned about speed, extensibility and maintainability, Python will serve you much better.

Second, ars gave a great answer with good links. Here are a few more packages that I view as essential to analytical work in Python:

  • matplotlib for beautiful, publication quality graphics.
  • IPython for an enhanced, interactive Python console. Importantly, IPython provides a powerful framework for interactive, parallel computing in Python.
  • Cython for easily writing C extensions in Python. This package lets you take a chunk of computationally intensive Python code and easily convert it to a C extension. You'll then be able to load the C extension like any other Python module but the code will run very fast since it is in C.
  • PyIMSL Studio for a collection of hundreds of mathemaical and statistical algorithms that are thoroughly documented and supported. You can call the exact same algorithms from Python and C, with nearly the same API and you'll get the same results. Full disclosure: I work on this product, but I also use it a lot.
  • xlrd for reading in Excel files easily.

If you want a more MATLAB-like interactive IDE/console, check out Spyder, or the PyDev plugin for Eclipse.


17

I use Python for statistical analysis and forecasting. As mentioned by others above, Numpy and Matplotlib are good workhorses. I also use ReportLab for producing PDF output.

I'm currently looking at both Resolver and Pyspread which are Excel-like spreadsheet applications which are based on Python. Resolver is a commercial product but Pyspread is still open-source. (Apologies, I'm limited to only one link)


36

I haven't seen the scikit-learn explicitly mentioned in the answers above. It's a Python package for machine learning in Python. It's fairly young but growing extremely rapidly (disclaimer: I am a scikit-learn developer). It's goals are to provide standard machine learning algorithmic tools in a unified interface with a focus on speed, and usability. As far as I know, you cannot find anything similar in Matlab. It's strong points are:

  • A detailed documentation, with many examples

  • High quality standard supervised learning (regression/classification) tools. Specifically:

  • The ability to perform model selection by cross-validation using multiple CPUs

  • Unsupervised learning to explore the data or do a first dimensionality reduction, that can easily be chained to supervised learning.

  • Open source, BSD licensed. If you are not in a purely academic environment (I am in what would be a national lab in the state) this matters a lot as Matlab costs are then very high, and you might be thinking of deriving products from your work.

Matlab is a great tool, but in my own work, scipy+scikit-learn is starting to give me an edge on Matlab because Python does a better job with memory due to its view mechanism (and I have big data), and because the scikit-learn enables me to very easily compare different approaches.


15

great overview so far. I'm using python (specifically scipy + matplotlib) as a matlab replacement since 3 years working at University. I sometimes still go back because I'm familiar with specific libraries e.g. the matlab wavelet package is purely awesome.

I like the http://enthought.com/ python distribution. It's commercial, yet free for academic purposes and, as far as I know, completely open-source. As I'm working with a lot of students, before using enthought it was sometimes troublesome for them to install numpy, scipy, ipython etc. Enthought provides an installer for Windows, Linux and Mac.

Two other packages worth mentioning:

  1. ipython (comes already with enthought) great advanced shell. a good intro is on showmedo http://showmedo.com/videotutorials/series?name=PythonIPythonSeries

  2. nltk - the natural language toolkit http://www.nltk.org/ great package in case you want to do some statistics /machine learning on any corpus.


10

Perhaps not directly related, but R has a nice GUI environment for interactive sessions (edit: on Mac/Windows). IPython is very good but for an environment closer to Matlab's you might try Spyder or IEP. I've had better luck of late using IEP, but Spyder looks more promising.

IEP: http://code.google.com/p/iep/

Spyder: http://packages.python.org/spyder/

And the IEP site includes a brief comparison of related software: http://code.google.com/p/iep/wiki/Alternatives


26

I would like to say that from the standpoint of someone who relies heavily on linear models for my statistical work, and love Python for other aspects of my job, I have been highly disappointed in Python as a platform for doing anything but fairly basic statistics.

I find R has much better support from the statistical community, much better implementation of linear models, and to be frank from the statistics side of things, even with excellent distributions like Enthought, Python feels a bit like the Wild West.

And unless you're working solo, the odds of you having collaborators who use Python for statistics, at this point, are pretty slim.


28

Perhaps this answer is cheating, but it seems strange no one has mentioned the rpy project, which provides an interface between R and Python. You get a pythonic api to most of R's functionality while retaining the (I would argue nicer) syntax, data processing, and in some cases speed of Python. It's unlikely that Python will ever have as many bleeding edge stats tools as R, just because R is a dsl and the stats community is more invested in R than possibly any other language.

I see this as analogous to using an ORM to leverage the advantages of SQL, while letting Python be Python and SQL be SQL.

Other useful packages specifically for data structures include:

  • pydataframe replicates a data.frame and can be used with rpy. Allows you to use R-like filtering and operations.
  • pyTables Uses the fast hdf5 data type underneath, been around for ages
  • h5py Also hdf5, but specifically aimed at interoperating with numpy
  • pandas Another project that manages data.frame like data, works with rpy, pyTables and numpy

19

What you are looking for is called Sage: http://www.sagemath.org/

It is an excellent online interface to a well-built combination of Python tools for mathematics.


8

I should add a shout-out for Sho, the numerical computing environment built on IronPython. I'm using it right now for the Stanford machine learning class and it's been really helpful. It's got built in linear algebra packages and charting capabilities. Being .Net it's easy to extend with C# or any other .Net language. I've found it much easier to get started with, being a windows user, than straight Python and NumPy.


143

As a numerical platform and a substitute for MATLAB, Python reached maturity at least 2-3 years ago, and is now much better than MATLAB in many respects. I tried to switch to Python from R around that time, and failed miserably. There are just too many R packages I use on a daily basis that have no Python equivalent. The absence of ggplot2 is enough to be a showstopper, but there are many more. In addition to this, R has a better syntax for data analysis. Consider the following basic example:

Python:

results = sm.OLS(y, X).fit()

R:

results <- lm(y ~ x1 + x2 + x3, data=A)

What do you consider more expressive? In R, you can think in terms of variables, and can easily extend a model, to, say,

lm(y ~ x1 + x2 + x3 + x2:x3, data=A)

Compared to R, Python is a low-level language for model building.

If I had fewer requirements for advanced statistical functions and were already coding Python on a larger project, I would consider Python as a good candidate. I would consider it also when a bare-bone approach is needed, either because of speed limitations, or because R packages don't provide an edge.

For those doing relatively advanced Statistics right now, the answer is a no-brainer, and is no. In fact, I believe Python will limit the way you think about data analysis. It will take a few years and many man-year of efforts to produce the module replacements for the 100 essential R packages, and even then, Python will feel like a language on which data analysis capabilities have been bolted on. Since R has already captured the largest relative share of applied statisticians across several fields, I don't see this happening any time soon. Having said that, it's a free country, and I know people doing Statistics in APL and C.


26

I am a biostatistician in what is essentially an R shop (~80 of folks use R as their primary tool). Still, I spend approximately 3/4 of my time working in Python. I attribute this primarily to the fact that my work involves Bayesian and machine learning approaches to statistical modeling. Python hits much closer to the performance/productivity sweet spot than does R, at least for statistical methods that are iterative or simulation-based. If I were performing ANOVAS, regressions and statistical tests, I'm sure I would primarily use R. Most of what I need, however, is not available as a canned R package.


26

There's really no need to give up R for Python anyway. If you use IPython with a full stack, you have R, Octave and Cython extensions, so you can easily and cleanly use those languages within your IPython notebooks. You also have support for passing values between them and your Python namespace. You can output your data as plots, using matplotlib, and as properly rendered mathematical expressions. There are tons of other features, and you can do all this in your browser.

IPython has come a long way :)


7

Note that SPSS Statistics has an integrated Python interface (also R). So you can write Python programs that use Statistics procedures and produce either the usual nicely formatted Statistics output or return results to your program for further processing. Or you can run Python programs in the Statistics command stream. You do still have to know the Statistics command language, but you can take advantage of all the data management, presentation output etc that Statistics provides as well as the procedures.


12

This is an interesting question, with some great answers.

You might find some useful discussion in a paper that I wrote with Roseline Bilina. The final version is here: http://www.enac.fr/recherche/leea/Steve%20Lawford/papers/python_paper_revised.pdf (it has since appeared, in almost this form, as "Python for Unified Research in Econometrics and Statistics", in Econometric Reviews (2012), 31(5), 558-591).


9

I found a great intro to pandas here that I suggest checking out. Pandas is an amazing toolset and provides the high level data analysis capabilities of R with the extensive libraries and production quality of Python.

This blog post gives a great intro to Pandas from the perspective of a complete beginner:

http://manishamde.github.com/blog/2013/03/07/pandas-and-python-top-10/


18

Rpy2 - play with R stay in Python...

Further elaboration per Gung's request:

Rpy2 documentation can be found at http://rpy.sourceforge.net/rpy2/doc-dev/html/introduction.html

From the documentation, The high-level interface in rpy2 is designed to facilitate the use of R by Python programmers. R objects are exposed as instances of Python-implemented classes, with R functions as bound methods to those objects in a number of cases. This section also contains an introduction to graphics with R: trellis (lattice) plots as well as the grammar of graphics implemented in ggplot2 let one make complex and informative plots with little code written, while the underlying grid graphics allow all possible customization is outlined.

Why I like it:

I can process my data using the flexibility of python , turn it into a matrix using numpy or pandas and do the computation in R, and get back r objects to do post processing. I use econometrics and python simply will not have the bleeding edge stats tools of R. And R will unlikely ever be as flexible as python. This does require you to understand R. Fortunately, it has a nice developer community.

Rpy2 itself is well supported and the gentleman supporting it frequents the SO forums. Windows installation maybe a slight pain - https://stackoverflow.com/questions/5068760/bizzarre-issue-trying-to-make-rpy2-2-1-9-work-with-r-2-12-1-using-python-2-6-un?rq=1 might help.


9

No one has mentioned Orange before:

Data mining through visual programming or Python scripting. Components for machine learning. Add-ons for bioinformatics and text mining. Packed with features for data analytics.

I don't use it on daily basis, but it's a must-see for anyone who prefers GUI over command line interface.

Even if you prefer the latter, Orange is a good thing to be familiar with, since you can easily import pieces of Orange to your Python scripts in case you need some of its functionality.


5

For those who have to work under Windows, Anaconda (https://store.continuum.io/cshop/anaconda/) really helps a lot. Installing packages under Windows was a headache. With Anaconda installed, you can set up a ready-to-use development environment with a one-liner.

For example, with

conda create -n stats_env python pip numpy scipy matplotlib pandas

all these packages will be fetched and installed automatically.


7

Recent comparison from DataCamp provides clear picture about R and Python.

The usage of these two languages in the data analysis field. Python is generally used when the data analysis tasks need to be integrated with web apps or if statistics code needs to be incorporated into a production database. R is mainly used when the data analysis tasks require standalone computing or analysis on individual servers.

I found it so useful in this blog and hope it would help others also to understand recent trends in both of these languages. Julia is also coming up in the area. Hope this helps !


6

Python has a long way to go before it can be compared to R. It has significantly fewer packages than R and of lower quality. People who stick to the basics or rely only on their custom libraries could probably do their job exclusively in Python but if you're someone who needs more advanced quantitative solutions, I dare to say that nothing comes close to R out there.

It should be also noted that, to date, Python has no proper scientific Matlab-style IDE comparable to R-Studio (please don't say Spyder) and you need to work out everything on the console. Generally speaking, the whole Python experience requires a good amount of "geekness" that most people lack and don't care about.

Don't get me wrong, I love Python, it's actually my favourite language which, unlike R, is a real programming language. Still, when it comes to pure data analysis I am dependent to R, which is by far the most specialised and developed solution to date. I use Python when I need to combine data analysis with software engineering, e.g. create a tool which will perform automatisation on the methods that I first programmed in a dirty R script. In many occasions I use rpy2 to call R from Python because in the vast majority of cases R packages are so much better (or don't exist in Python at all). This way I try to get the best of both worlds.

I still use some Matlab for pure algorithm development since I love its mathematical-style syntax and speed.


6

I believe Python is a superior workbench in my field. I do a lot of scraping, data wrangling, large data work, network analysis, Bayesian modeling, and simulations. All of these things typically need speed and flexibility so I find Python to work better than R in these cases. Here are a few things about Python that I like (some are mentioned above, other points are not):

-Cleaner syntax; more readable code. I believe Python to be a more modern and syntactically consistent language.

-Python has Notebook, Ipython, and other amazing tools for code sharing, collaboration, publishing.

-iPython's notebook enables one to use R in one's Python code so it is always possible to go back to R.

-Substantially faster without recourse to C. Using Cython, NUMBA, and other methods of C integration will put your code to speeds comparable to pure C. This, as far as I am aware, cannot be achieved in R.

-Pandas, Numpy, and Scipy blow standard R out of the water. Yes, there are a few things that R can do in a single line but takes Pandas 3 or 4. In general, however, Pandas can handle larger data sets, is easier to use, and provides incredible flexibility in regard to integration with other Python packages and methods.

-Python is more stable. Try loading a 2gig dataset into RStudio.

-One neat package that doesn't seem mentioned above is PyMC3 - great general package for most of your Bayesian modeling.

-Some, above mention ggplot2 and grub about its absence from Python. If you ever used Matlab's graphing functionalities and/or used matplotlib in Python then you'll know that the latter options are generally much more capable than ggplot2.

However, perhaps R is easier to learn and I do frequently use it in cases where I am not yet too familiar with the modeling procedures. In that case, the depth of R's off-the-shelf statistical libraries is unbeatable. Ideally, I would know both well enough to be able to use upon need.


0

I thought I'd add a more up-to-date answer than those given. I'm a Python guy, through-and-through, and here's why:

  1. Python is easily the most intuitive syntax of any programming language I've ever used, except possibly LabVIEW. I can't count the number of times I've simply tried 20-30 lines of code in Python, and they've worked. That's certainly more than can be said for any other language, even LabVIEW. This makes for extremely fast development time.

  2. Python is performant. This has been mentioned in other answers, but it bears repeating. I find Python opens large datasets reliably.

  3. The packages in Python are fast catching up to R's packages. Certainly, Python use has considerably outstripped R use in the last few years, although technically this argument is, of course, an ad populum.

  4. More and more, I find readability to be among the most important qualities good code can possess, and Python is the most readable language ever (assuming you follow reasonably good coding practices, of course). Some of the previous answers have tried to argue that R is more readable, but the examples they've shown all prove the opposite for me: Python is more readable than R, and it's also much quicker to learn. I learned basic Python in one week!

  5. The Lambda Labs Stack is a newer tool than Anaconda, and one-ups it, in my opinion. The downside: you can only install it in Ubuntu 16.04 and 18.04, and the Ubuntu derivatives in those versions. The upside: you get all the standard GPU-accelerated packages managed for you, all the way to the hardware drivers. Anaconda doesn't do that. The Lambda Labs stack maintains compatible version numbers all the way from your Theano or Keras version to the NVIDIA GPU driver version. As you are probably aware, this is no trivial task. When it comes to machine learning, Python is king, hands-down. And GPU acceleration is something most data professionals find they can't do without.

  6. Python has an extremely well-thought-out IDE now: PyCharm. In my opinion, this is what serious Python developers should be using - definitely NOT Jupyter notebooks. While many people use Visual Studio Code, I find PyCharm to be the best IDE for Python. You get everything you could practically want - IPython, terminal, advanced debugging tools including an in-memory calculator, and source code control integration.

  7. Many people have said that Python's stats packages aren't as complete as R's. No doubt that's still somewhat true (although see 3 above). On the other hand, I, for one, have not needed those incredibly advanced stats packages. I prefer waiting on advanced statistical analysis until I fully understand the business question being asked. Often times, a relatively straight-forward algorithm for computing a metric is the solution to the problem, in which case Python makes a superb tool for calculating metrics.

  8. A lot of people like to tout R's ability to do powerful things in only one line of code. In my opinion, that isn't a terribly great argument. Is that one line of code readable? Typical code is written one time, and read ten times! Readability counts, as the Zen of Python puts it. I'd rather have several lines of readable Python code than one cryptic line of R code (not that those are the only choices, of course; I just want to make the point that fewer lines of code does not equal greater readability).

  9. Incidentally, I just can't resist making a comment about SAS, even though it's slightly off-topic. As I said in an earlier point, I learned basic Python in one week. Then I tried SAS. I worked my way through about three chapters of an 11-chapter book on SAS, and it took me two months! Moreover, whenever I tried something, it never ever worked the first time. I would STRONGLY urge you to abandon SAS as soon as you can. It has an extremely convoluted syntax, is extraordinarily unforgiving, etc. About the only good thing that can be said about it is that its statistical capabilities are as complete as R's. Whoopty do.

So, there you have it. Python all the way!