Installation of Python and Package Management

Chapter cheat sheet

command function
python launch python interpreter
conda create --name pyearthsci make an Anaconda environment called "pyearthsci"
conda activate pyearthsci activate the Anaconda environment called "pyearthsci"
conda install spyder install the package called "spyder"
import numpy as np import the package numpy with the alias "np"
python PythonHello.py run the python script called "PythonHello.py"
ipython launch an IPython console
jupyter notebook launch a Jupyter notebook

Introduction

If you are currently using a recent Mac or Linux operating system, open a terminal and type,

python

and you should see something like,

Python 2.7.12
Type “help”, “copyright”, “credits” or “license” for more information.

You have just entered the native installation of Python on your computer, no extra steps needed. This is because, though it is a great tool for earth science and data analytics, Python is a general purpose language that is used by all sorts of programs and utilities. While is it nice the Python is a very open and widely used tool, one should also take care that this native installation is not modified to the point that the other useful and essential utilities that depend on it are disrupted. For instance, a package or command may no longer be installed where it originally was by the operating system. For this reason, this chapter will outline how to install a modern version of Python, as well as many packages useful for data science, in a tidy environment all it’s own.

Python, a brief history

As the story goes, in 1989 Guido van Rossum decided he needed something to do over the Christmas holidays, and instead of reading a nice book or learning to brew his own beer, he decided to develop the scripting language with the name of Python, named after Monty Python’s Flying Circus. Since then, Python has come to be know by several core principles, most notable of which are the focus on readability and requiring fewer lines of code. Because of this, lines of code can almost be read as a sentence in plain English. For example, if I would like to add one to every number in my list, I would,

[number+1 for number in my_list]

Though this may look daunting if you are new to coding, if you read it out loud you can almost hear what it does. And along this line, the Python philosophy tends not to be that there are many clever ways to do one thing, but one very clear way. Because of these ideologies Python can be a very useful and rewarding coding language to learn, which is reflected in it’s popularity.

Python 2 vs Python 3

As you start in Python, you will quickly find yourself wondering why there are two different versions being used. Python 2 was released in 2000 as the first major update, and many programs have been written using this flavor. However, in 2010 with the release of version 2.7, it was announce that Python 2 would be phased out in favor of the new Python 3, so there is no plan for a version 2.8. This major update from 2 to 3 was made to change some small yet significant things to the language, such as how it handles text data and iterates through lists and dictionaries. The idea is that it is better to update a language to fix things, than always dealing with small bugs because of refusal to change. As Python 2 is scheduled to be retired in the next few years, this manual will focus on using Python 3. This does mean that your Python 3 code may not work with your native Python 2 installation, but in the realm of data science, as you will be using so many specialized packages of code, this would be the case anyway. In the end, you will be using a self contained Python environment that contains your Python installation, as well as all the code you will be using, in one neat little box.

Environments and packages

Using other people’s code

As Python is a general purpose language, the basic functionality out of the box is also very general: things such as basic math, file manipulation, and printing output. So if you want to do anything beyond what is defined in the core language, you need to write your own little bit of code to do it. However, as you are taking a Python course, you can assume that the first time you need a bit of code that the core Python doesn’t have built-in, something like calculating the standard deviation of a set of numbers, someone else will have probably run into the same issue before you. Luckily, the Python community is very active in writing these bits of code and sharing them so that you don’t have to write every function from scratch. Not only that, many of these little bits of code have been bundled into large collections of code called packages. For example, the mean, median, standard deviation, percentile, and other statistical functions are already built into a package called NumPy (Numerical Python) which gives you access to a whole bunch of bits of code. Not only that, there are dedicated package managers that will take care of downloading and installing the package, as well as making sure it plays nice with all the other packages you are using, all you have to do is tell it which package!

Which package manager to use?

Probably the most common package manager is called pip. pip is a wonderfully useful tool that is widely supported, which you will not use. Instead you will use Anaconda for the following reasons:

  1. Anaconda is designed for data science.

  2. Anaconda will handle not only the Python packages, but non-Python libraries such as HDF5 (which allows us to read some data files) and the Math Kernel Library. It will even manage an R installation.

  3. Anaconda also manages environments which are important because they:

    • Keep your Python installations working together.

    • Keep separate collections of packages in case some don’t work well together.

    • Are duplicatable and exportable, so your work can be replicated.

Versions, packages, environments, why so complicated?

Though this all may seem a bit complicated to just make a plot or do some math, it becomes necessary because of two main issues: the computer needs to know where to look for things, and what to call them.

Just like when you go back to look at the wonderful photos you took on vacation 3 years ago, only to find a giant mess of folders and sub-folders to go through, your computer also has to look through all it’s memory to find where a bit of code might be located. When properly managed, all the files are put in the appropriate place, where the computer can easily find them. Similarly, if I have a file in the folder Photos/ called MyBestPicture.jpg, and I have a different file in the folder Photos2/ called MyBestPicture.jpg, when I tell my computer I want MyBestPicture.jpg, it has no idea which one you mean. In this way, by using these tools, you keep everything nice and tidy.

Installing Anaconda

Anaconda is a commercially maintained but free and open-source package manager designed for data science. As such, the developers have made it quite easy to install on Windows, Mac, and Linux. Simply go to https://www.anaconda.com/download/, find your operating system, and download the appropriate Python 3.7 version installer for your operating system. Again, you want to use version 3.7, but if you end up mixing up versions or already have another version installed don’t panic, you can create a Python 3.7 environment later.

Windows installation notes

Installation on Microsoft Windows is fairly straight forward, but can take quite some time. Simply follow the graphical installer, with the only thing to change is to uncheck the option to register Anaconda as the default Python installation. Though this is not as vital as with Unix based systems, it is still a good idea. After the long installation prompt, you can access an Anaconda command line via Anaconda Prompt in the Start Menu.

OSX installation notes

Installation on OSX should be quite straight forward, simply follow the installation guide of the graphical installer.

Linux installation notes

Once the file has been downloaded, open a terminal and navigate to where the file was saved. The file installer is a bash script, which can be run by entering

bash Anaconda3-FILE-NAME.sh

where Anaconda3-FILE-NAME.sh is the name of your file. The package will ask you to review the licence information and agree. You will then be asked if you would like to install Anaconda in another location, and you can simply install into the default location. The installer will then proceed to install Anaconda on your machine. Once the installation is complete, the installer will ask “You may wish to edit your .bashrc or prepend the Anaconda3 install location:”, followed by a suggested command that looks something like,

export PATH=/YOUR/PATH/TO/anaconda3/bin:\$PATH

In order to make Anaconda work, you need to add the file path of Anaconda to a variable the operating system uses called \$PATH. To do this, you can add a modified version of this line to a file called .bashrc in your home folder. Simply go to your home folder and open the file .bashrc with a text editor, and at the end of the file add the line,

export PATH=\$PATH:/YOUR/PATH/TO/anaconda3/bin

where the /YOUR/PATH/TO/anaconda3/bin is the same one that Anaconda suggested at the end of installation. If you forgot it, it should be something like

/home/YOURNAME/anaconda3/bin

You may notice that we just switched where \$PATH goes in that command. This is because you want to add your Anaconda location to end of \$PATH, meaning that the operating system looks in this folder last instead of first. This ensures that you don’t cause any problems with the native Python installation.

Creating your first environment

First, you will verify that your anaconda installation is working. To do so, open a new command line and simply type,

conda

You should see a nice overview of how to use the conda command. If this is not the case, either the installation didn’t work, or you might have a problem with your PATH (where the computer looks for commands). But, if it worked, you can move on to creating your first environment. you will name the environment pyearthsci and you will initially only install the numpy package. In the command line, input:

conda create --name pyearthsci numpy

You will be asked if you would like to proceed in installing a bunch of new packages, way more than numpy, and you can say yes. The reason so many new packages were listed is the magic of a package manager. The basic Python 3 with the numpy package actually depends on all these underlying dependencies, which Anaconda kindly figures out for you. So now you have your nice new environment, and you can activate it by entering

conda activate pyearthsci

on Mac or Linux and

activate pyearthsci

on Windows.

You command line should now tell you that you are now in the pyearthsci environment. If you now open a Python console by typing python in the command line, your version should now be 3.6.0. In this same manner, you can do things like duplicate and export your environments, or make new environments with different packages or even different Python versions.

Installing a package

Now that you are in your nice new environment, you can add any package you might need. Open an command line and enter the pyearthsci environment. Now to install the Pandas package, you simply enter,

conda install spyder

Anaconda will list all the package changes it will make, and ask if you would like to proceed. Confirm yes, then let the magic happen. Now you have the Spyder IDE, which you can use to develop code (similar concept to R Commander or the MATLAB IDE). You can also give multiple programs at a time,

conda install scipy matplotlib

or even pass a yml file listing the packages you need with

conda install -f pyearthsci.yml

Anaconda has some nice documentation about how to use their software, including how to search for packages not in their repositories, which we will not cover here. Now that you have your installation and environment all sorted out, you can start to explore Python itself a bit in the next chapters.

Importing packages

Now that you have your python environment set up and packages installed, you're ready to use them. Open a python environment by typing python in a command line, just like you did at the at the beginning of this introduction. You have entered the python console, the real workhorse of python development. Now, try importing numpy:

import numpy

You have now successfully imported the all powerful numpy library! Feel the power of linear algebra in your hands! If you need further proof that you are actually wielding the power of numpy, we can do a quick test by verifying which module a function is coming from, specifically two min (minimize) functions:

min.__module__
↳ 'builtins'
numpy.min.__module__
↳ 'numpy.core.fromnumeric'

What you see is that the first function min comes from builtins which means it is a core python function, an integral part of the language. In contrast, numpy.min is a function of the numpy module which gives increases in speed and functionality. It is important to note that while both functions will tell you a minimum out of a set of numbers, they are not the same function, and can cause you headaches if you switch them up. That is why when you import numpy, you access numpy functions by referencing it explicitly as numpy.min. However typing numpy. over and over can be tedious, so we can give it a nickname, or alias:

import numpy as np
np.min.__module__
↳ 'numpy.core.fromnumeric'

This way we can track which function comes from where and always know what package we are actually using.

Cautionary note:

You may encounter someone importing functions using something like from numpy import *, and this is bad practice! If you don't want to read the explanation here is the short version: don't import things this way. Now if you want to know why, read on.

This command tries to import everything from numpy so you do not have to put the 'np.' prefix. Here is an example of why it should be avoided, exit out of python (Ctrl-D or exit()) and open a fresh python console (just to refresh everything). Now try:

min.__module__
↳ 'builtins'
amin.__module__
↳ Traceback (most recent call last):
↳ File "<stdin>", line 1, in <module>
↳ NameError: name 'amin' is not defined

Here, we looked for the parent module of min and amin. Python was unable to find amin, which makes since because it is a function in the numpy package that calculates the minimum along an axis (axis minimum). Now if we import numpy the bad way, here is what you get:

from numpy import *
min.__module__
↳ 'builtins'
amin.__module__
↳ 'numpy.core.fromnumeric'

Notice that the min function is still part of builtins but amin is now part of numpy, which means that numpy.min is actually missing! The consequence is we have a really hard time keeping track of which functions are coming from what package. This may all seem a bit silly, but take my word for it, this can really cause headaches, so just avoid it.

Consoles: python, IPython, IDEs

So far we have been using the simple python interpreter to run python code, which is the most basic way to interact with python. However, there are other ways which try to give improved functionality to make to most of your python experience. Let's explore some of the basic methods in brief.

Sticking to the most basic, lets run a "hello world" from the python interpreter.

print("Hello world!")
↳ Hello world!

easy peasy.

Now for level 2, open a blank text file, using something like "wordpad" or "notepad", and enter the same command as before:

print("Hello world!")

Now save your file as something like "PythonHello.py" somewhere you can access easily, Once it's saved, open a terminal (not python, just plain terminal) and activate your environment. Finally, go to the directory with your file and type,

python PythonHello.py
↳ Hello world!

Voilà, you just ran your first python script! Just like when you type commands in the python interpreter, the lines in the script are run in sequence.

IPython

Through the years, many projects have tried to improve on the basic python interpreter to make writing and debugging easier. One of the most popular is called IPython. When you installed Spyder earlier, you actually installed IPython, because Spyder uses IPython to function. So you are all set to enter an IPython console. From a terminal, in your environment, simply type:

ipython
↳ Python 3.7.0 (default, Oct  9 2018, 10:31:47)
↳ Type 'copyright', 'credits' or 'license' for more information
↳ IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

This is now an IPython console, which gives a bit more interactive experience. Try the following: in your IPython console, import numpy as np, then type np.min( but instead of hitting enter, hit tab. Like magic, you can see all the potential inputs to the np.min function. This is by far not the only advantage of IPython, but most of them will come later when doing things like plotting.

Spyder

Going a step further past our interpreters and interactive consoles, we have Integrated Development Environments or IDEs. There are many IDEs for python, so many that we won't go into them. As a particular example, we will look at Spyder, which is a powerful and popular choice. If you remember, we installed Spyder as an example earlier. You can launch Spyder from the command line by simply typing Spyder, which should give initially start simple looking but functional GUI with a couple of sub-windows. One sub-window should be the "editor", in which you can write scripts, and another should be the IPython console. This is exactly the same as if you were writing your script in a text editor and had your IPython console open in a terminal, except you get an integrated environment. This can be really handy. For example, if you open your "PythonHello.py" script from earlier, then with your cursor in you script press "F5", you can see that your script runs in the IPython console directly. Neat! There are plenty more nifty features for you to explore, but do it on your own time.

Jupyter

Another powerful way to interact with python that has become increasingly popular in the last few years are interactive notebooks, the most common of which is called a Jupyter notebook. These are python instances you can interact with using your browser, and allow you to intermingle text and bits of code all in one document. These documents can be a good way to document code or just a nice environment to code in, they even render on sites such as github. You can see one such example here.

To get a Jupyter notebook up and running, you first need to install Jupyter (conda install jupyter). Once installed run the command jupyter notebook into a terminal. This should open a web browser with the Jupyter interface, if it faild, just copy and paste the URL from the terminal (or click on the URL while holding Ctrl). You should see all the files contained in the folder where you launched Jupyter. Now open a new notebook by clicking new -> Python3. Now you are in a Jupyter notebook, from which you can type bits of code in a block, and execute them (Run Cells). Go ahead and try a Hello World.

Which way to interact with python?

As we have seen, there are many different ways to interact with python, from directly running commands in the python interpreter, to Jupyter notebooks. Any of these methods will work for the exercises in the following chapters, feel free to go back and forth and find what is comfortable.