分享

Getting Started with Python for Data Scientists

 dinghj 2014-04-24

With the R Users DC Meetup broadening its topic base to include other statistical programming tools, it seemed only reasonable to write a meta post highlighting some of the best Python tutorials and resources available for data science and statistics. What you don’t know is often the hardest part of picking up a new skill, so hopefully these resources will help make learning Python a little easier. Prepare yourself for code indentation heaven.

Python is such an incredible language because it can do practically anything, from high performance scientific computing to web frameworks such as Django or Flask.  Python is heavily used at Google so the language must be doing something right. And, similar to R, Python has a fantastic community around it and, luckily for you, this community can write. Don’t just take my word for it, watch the following video to fully understand.

 

python

Distributions

Python is available for free from http://www./ and there are two popular versions, 2.7 or 3.x.  Which should you choose? I would either go with whatever is currently installed on your system or 2.7. For a better discusion, check out this site.

Commercial distributions are also available that have included and tested various useful packages such as the Enthought Python Distribution. This distribution provides a comprehensive, cross-platform environment for scientific computing with the Python programming language. A single-click installer allows immediate access to over 100 libraries and tools. Our open source initiatives include SciPy,NumPy, and the Enthought Tool Suite.

Python Developer Tools

Getting started with a new programming language often requires getting started with a new tool to use the language, unless you are a hardcore VI, VIM, or EMACS person. Python is no exception and there are a great number of editors or full-blown IDEs to try out:

Sublime Text2 - If you have never used it, you should try this editor. “Sublime Text is a sophisticated text editor for code, markup and prose. You’ll love the slick user interface, extraordinary features and amazing performance.”

IPython provides a rich architecture for interactive computing with:

  • Powerful interactive shells (terminal and Qt-based).
  • A browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media.
  • Support for interactive data visualization and use of GUI toolkits.
  • Flexible, embeddable interpreters to load into your own projects.
  • Easy to use, high performance tools for parallel computing.

NINJA-IDE  (free) (from the recursive acronym: “Ninja-IDE Is Not Just Another IDE”), “is a cross-platform integrated development environment (IDE). NINJA-IDE runs on Linux/X11, Mac OS X and Windows desktop operating systems, and allows developers to create applications for several purposes using all the tools and utilities of NINJA-IDE, making the task of writing software easier and more enjoyable.”

PyCharm by Jetbrains (not free) – the folks at Jetbrains make great tools and PyCharm is no exception.

 

Learning Python

Learn about Packages

Python is known for it’s “batteries included” philosophy and has a rich standard library. However, being a popular language, the number of third party packages is much larger than the number of standard library packages. So it eventually becomes necessary to discover how packages are used, found and created in Python

 

Package Management and Installation

Once you know a bit about packages, you will start installing them. There is no better ways to get this done than with either the EasyInstall or PIP package managers. It is recommended that you use PIP as it newer and seems to have larger support.

For Windows users sometimes it helpful to use the pre-built binaries maintained here: http://www.lfd./~gohlke/pythonlibs/

You will notice that not all packages have been ported to 3.x. This is true of many popular libraries and it is why 2.6 or 2.7 is recommended.

Virtualenv – learn it early and use it

Package management can be a pain point when working across systems or when deploying larger applications in production environments. For this reason it is  HIGHLY RECOMMENDED that you get comfortable with the wonderful virtualenv package. Here is a good intro to virtualenv for ubuntu (for the windows users… well just go install ubuntu) . The basic idea is that each of your projects gets a self-contained python environment which can be shipped to a new machine and carry its Gordian knot of dependencies with it.

Python Koans – the zen of python

This project is great for those who want to dive right in. It is based on a ruby project which presents the language as a series of failed unit tests. You must edit the source until the unit test passes. It is wonderful and is an introduction to TTD(Test Driven Development) while you learn python.

https://github.com/gregmalcolm/python_koans/wiki

 

Python the Hard Way 

Yes, here is an entire book on python for free online or you can upgrade for even more content and videos. And yes, the book is pretty good.

Welcome to the 3rd Edition of Learn Python the hard way. You can visit the companion site to the book at http:/// where you can purchase digital downloads and paper versions of the book. The free HTML version of the book is available at http:///book/.

 

Python’s Execution Model
If you want to dive deeper into the underlying execution model of Python, there is no better place to start than this fantastic post:

Those new to Python are often surprised by the behavior of their own code. They expect A but, seemingly for no reason, B happens instead. The root cause of many of these “surprises” is confusion about the Python execution model. It’s the sort of thing that, if it’s explained to you once, a number of Python concepts that seemed hazy before become crystal clear. It’s also really difficult to just “figure out” on your own, as it requires a fundamental shift in thinking about core language concepts like variables, objects, and functions.

In this post, I’ll help you understand what’s happening behind the scenes when you do common things like creating a variable or calling a function. As a result, you’ll write cleaner, more comprehensible code. You’ll also become a better (and faster) code reader. All that’s necessary is to forget everything you know about programming…

Python for Numerical and Scientific Computing

NumPy, SciPy, and matplotlib form the basis for scientific computing in Python.

NumPy

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

  • a powerful N-dimensional array object
  • sophisticated (broadcasting) functions
  • tools for integrating C/C++ and Fortran code
  • useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

 

SciPy

SciPy (pronounced “Sigh Pie”) is open-source software for mathematics, science, and engineering. It is also the name of a very popular conference on scientific programming with Python. The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization. Together, they run on all popular operating systems, are quick to install, and are free of charge. NumPy and SciPy are easy to use, but powerful enough to be depended upon by some of the world’s leading scientists and engineers. If you need to manipulate numbers on a computer and display or publish the results, give SciPy a try!

 

Matplotlib

matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB?* or Mathematica??), web application servers, and six graphical user interface toolkits.

 

Python for Data

Pandas

Pandas is really the Python approximation to R, although most would argue that it isn’t yet as full featured as R. Or, in the words of the website, ”pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.”

Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

Combined with the excellent IPython toolkit and other libraries, the environment for doing data analysis in Python excels in performance, productivity, and the ability to collaborate.

 

Statsmodels

Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator. Researchers across fields may find that statsmodels fully meets their needs for statistical computing and data analysis in Python.

 

The following two tabs change content below.
My Twitter profileMy Facebook profileMy Google+ profileMy LinkedIn profile

Sean Murphy

Senior Scientist and Data Science Consultant at JHU
Sean Patrick Murphy, with degrees in math, electrical engineering, and biomedical engineering and an MBA from Oxford, has served as a senior scientist at Johns Hopkins University for over a decade, advises several startups, and provides learning analytics consulting for EverFi. Previously, he served as the Chief Data Scientist at a series A funded health care analytics firm, and the Director of Research at a boutique graduate educational company. He has also cofounded a big data startup and Data Community DC, a 2,000 member organization of data professionals. Find him on LinkedIn, Twitter, and .

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多