Installation of Python and Package Management
Chapter cheat sheet
command | function |
---|---|
python | launch python interpreter |
conda create --name pyearthsci | make an Anaconda environment called "pyearthsci" |
conda activate pyearthsci | activate the Anaconda environment called "pyearthsci" |
conda install spyder | install the package called "spyder" |
import numpy as np | import the package numpy with the alias "np" |
python PythonHello.py | run the python script called "PythonHello.py" |
ipython | launch an IPython console |
jupyter notebook | launch a Jupyter notebook |
Introduction
If you are currently using a recent Mac or Linux operating system, open a terminal and type,
python
and you should see something like,
Python 2.7.12
Type “help”, “copyright”, “credits” or “license” for more information.
You have just entered the native installation of Python on your computer, no extra steps needed. This is because, though it is a great tool for earth science and data analytics, Python is a general purpose language that is used by all sorts of programs and utilities. While is it nice the Python is a very open and widely used tool, one should also take care that this native installation is not modified to the point that the other useful and essential utilities that depend on it are disrupted. For instance, a package or command may no longer be installed where it originally was by the operating system. For this reason, this chapter will outline how to install a modern version of Python, as well as many packages useful for data science, in a tidy environment all it’s own.
Python, a brief history
As the story goes, in 1989 Guido van Rossum decided he needed something to do over the Christmas holidays, and instead of reading a nice book or learning to brew his own beer, he decided to develop the scripting language with the name of Python, named after Monty Python’s Flying Circus. Since then, Python has come to be know by several core principles, most notable of which are the focus on readability and requiring fewer lines of code. Because of this, lines of code can almost be read as a sentence in plain English. For example, if I would like to add one to every number in my list, I would,
[number+1 for number in my_list]
Though this may look daunting if you are new to coding, if you read it out loud you can almost hear what it does. And along this line, the Python philosophy tends not to be that there are many clever ways to do one thing, but one very clear way. Because of these ideologies Python can be a very useful and rewarding coding language to learn, which is reflected in it’s popularity.
Python 2 vs Python 3
As you start in Python, you will quickly find yourself wondering why there are two different versions being used. Python 2 was released in 2000 as the first major update, and many programs have been written using this flavor. However, in 2010 with the release of version 2.7, it was announce that Python 2 would be phased out in favor of the new Python 3, so there is no plan for a version 2.8. This major update from 2 to 3 was made to change some small yet significant things to the language, such as how it handles text data and iterates through lists and dictionaries. The idea is that it is better to update a language to fix things, than always dealing with small bugs because of refusal to change. As Python 2 is scheduled to be retired in the next few years, this manual will focus on using Python 3. This does mean that your Python 3 code may not work with your native Python 2 installation, but in the realm of data science, as you will be using so many specialized packages of code, this would be the case anyway. In the end, you will be using a self contained Python environment that contains your Python installation, as well as all the code you will be using, in one neat little box.
Environments and packages
Using other people’s code
As Python is a general purpose language, the basic functionality out of the box is also very general: things such as basic math, file manipulation, and printing output. So if you want to do anything beyond what is defined in the core language, you need to write your own little bit of code to do it. However, as you are taking a Python course, you can assume that the first time you need a bit of code that the core Python doesn’t have built-in, something like calculating the standard deviation of a set of numbers, someone else will have probably run into the same issue before you. Luckily, the Python community is very active in writing these bits of code and sharing them so that you don’t have to write every function from scratch. Not only that, many of these little bits of code have been bundled into large collections of code called packages. For example, the mean, median, standard deviation, percentile, and other statistical functions are already built into a package called NumPy (Numerical Python) which gives you access to a whole bunch of bits of code. Not only that, there are dedicated package managers that will take care of downloading and installing the package, as well as making sure it plays nice with all the other packages you are using, all you have to do is tell it which package!
Which package manager to use?
Probably the most common package manager is called pip. pip is a wonderfully useful tool that is widely supported, which you will not use. Instead you will use Anaconda for the following reasons:
-
Anaconda is designed for data science.
-
Anaconda will handle not only the Python packages, but non-Python libraries such as HDF5 (which allows us to read some data files) and the Math Kernel Library. It will even manage an R installation.
-
Anaconda also manages environments which are important because they:
-
Keep your Python installations working together.
-
Keep separate collections of packages in case some don’t work well together.
-
Are duplicatable and exportable, so your work can be replicated.
-
Versions, packages, environments, why so complicated?
Though this all may seem a bit complicated to just make a plot or do some math, it becomes necessary because of two main issues: the computer needs to know where to look for things, and what to call them.
Just like when you go back to look at the wonderful photos you took on vacation 3 years ago, only to find a giant mess of folders and sub-folders to go through, your computer also has to look through all it’s memory to find where a bit of code might be located. When properly managed, all the files are put in the appropriate place, where the computer can easily find them. Similarly, if I have a file in the folder Photos/ called MyBestPicture.jpg, and I have a different file in the folder Photos2/ called MyBestPicture.jpg, when I tell my computer I want MyBestPicture.jpg, it has no idea which one you mean. In this way, by using these tools, you keep everything nice and tidy.
Installing Anaconda
Anaconda is a commercially maintained but free and open-source package manager designed for data science. As such, the developers have made it quite easy to install on Windows, Mac, and Linux. Simply go to https://www.anaconda.com/download/, find your operating system, and download the appropriate Python 3.7 version installer for your operating system. Again, you want to use version 3.7, but if you end up mixing up versions or already have another version installed don’t panic, you can create a Python 3.7 environment later.
Windows installation notes
Installation on Microsoft Windows is fairly straight forward, but can take quite some time. Simply follow the graphical installer, with the only thing to change is to uncheck the option to register Anaconda as the default Python installation. Though this is not as vital as with Unix based systems, it is still a good idea. After the long installation prompt, you can access an Anaconda command line via Anaconda Prompt in the Start Menu.
OSX installation notes
Installation on OSX should be quite straight forward, simply follow the installation guide of the graphical installer.
Linux installation notes
Once the file has been downloaded, open a terminal and navigate to where the file was saved. The file installer is a bash script, which can be run by entering
bash Anaconda3-FILE-NAME.sh
where Anaconda3-FILE-NAME.sh is the name of your file. The package will ask you to review the licence information and agree. You will then be asked if you would like to install Anaconda in another location, and you can simply install into the default location. The installer will then proceed to install Anaconda on your machine. Once the installation is complete, the installer will ask “You may wish to edit your .bashrc or prepend the Anaconda3 install location:”, followed by a suggested command that looks something like,
export PATH=/YOUR/PATH/TO/anaconda3/bin:\$PATH
In order to make Anaconda work, you need to add the file path of Anaconda to a variable the operating system uses called \$PATH. To do this, you can add a modified version of this line to a file called .bashrc in your home folder. Simply go to your home folder and open the file .bashrc with a text editor, and at the end of the file add the line,
export PATH=\$PATH:/YOUR/PATH/TO/anaconda3/bin
where the /YOUR/PATH/TO/anaconda3/bin is the same one that Anaconda suggested at the end of installation. If you forgot it, it should be something like
/home/YOURNAME/anaconda3/bin
You may notice that we just switched where \$PATH goes in that command. This is because you want to add your Anaconda location to end of \$PATH, meaning that the operating system looks in this folder last instead of first. This ensures that you don’t cause any problems with the native Python installation.
Creating your first environment
First, you will verify that your anaconda installation is working. To do so, open a new command line and simply type,
conda
You should see a nice overview of how to use the conda command. If this is not the case, either the installation didn’t work, or you might have a problem with your PATH (where the computer looks for commands). But, if it worked, you can move on to creating your first environment. you will name the environment pyearthsci and you will initially only install the numpy package. In the command line, input:
conda create --name pyearthsci numpy
You will be asked if you would like to proceed in installing a bunch of new packages, way more than numpy, and you can say yes. The reason so many new packages were listed is the magic of a package manager. The basic Python 3 with the numpy package actually depends on all these underlying dependencies, which Anaconda kindly figures out for you. So now you have your nice new environment, and you can activate it by entering
conda activate pyearthsci
on Mac or Linux and
activate pyearthsci
on Windows.
You command line should now tell you that you are now in the pyearthsci
environment. If you now open a Python console by typing python
in the
command line, your version should now be 3.6.0. In this same manner, you
can do things like duplicate and export your environments, or make new
environments with different packages or even different Python versions.
Installing a package
Now that you are in your nice new environment, you can add any package you might need. Open an command line and enter the pyearthsci environment. Now to install the Pandas package, you simply enter,
conda install spyder
Anaconda will list all the package changes it will make, and ask if you would like to proceed. Confirm yes, then let the magic happen. Now you have the Spyder IDE, which you can use to develop code (similar concept to R Commander or the MATLAB IDE). You can also give multiple programs at a time,
conda install scipy matplotlib
or even pass a yml file listing the packages you need with
conda install -f pyearthsci.yml
Anaconda has some nice documentation about how to use their software, including how to search for packages not in their repositories, which we will not cover here. Now that you have your installation and environment all sorted out, you can start to explore Python itself a bit in the next chapters.
Importing packages
Now that you have your python environment set up and packages installed,
you're ready to use them. Open a python environment by typing python
in a command line, just like you did at the at the beginning of this
introduction. You have entered the
python console, the real workhorse of python development. Now, try importing
numpy:
import numpy
You have now successfully imported the all powerful numpy library! Feel the power of linear algebra in your hands! If you need further proof that you are actually wielding the power of numpy, we can do a quick test by verifying which module a function is coming from, specifically two min (minimize) functions:
min.__module__
↳ 'builtins'
numpy.min.__module__
↳ 'numpy.core.fromnumeric'
What you see is that the first function min
comes from builtins which means
it is a core python function, an integral part of the language. In contrast,
numpy.min
is a function of the numpy module which gives increases in speed
and functionality. It is important to note that while both functions will tell
you a minimum out of a set of numbers, they are not the same function, and can
cause you headaches if you switch them up. That is why when you import numpy,
you access numpy functions by referencing it explicitly as numpy.min
. However
typing numpy.
over and over can be tedious, so we can give it a nickname, or alias:
import numpy as np
np.min.__module__
↳ 'numpy.core.fromnumeric'
This way we can track which function comes from where and always know what package we are actually using.
Cautionary note:
You may encounter someone importing functions using something like from numpy import *
,
and this is bad practice! If you don't want to read the explanation here is the short version:
don't import things this way. Now if you want to know why, read on.
This command tries to import everything from numpy
so you do not have to put the 'np.' prefix. Here is an example of why it should be avoided,
exit out of python (Ctrl-D
or exit()
)
and open a fresh python console (just to refresh everything). Now try:
min.__module__
↳ 'builtins'
amin.__module__
↳ Traceback (most recent call last):
↳ File "<stdin>", line 1, in <module>
↳ NameError: name 'amin' is not defined
Here, we looked for the parent module of min
and amin
. Python was unable
to find amin
, which makes since because it is a function in the numpy package
that calculates the minimum along an axis (axis minimum). Now if we import
numpy the bad way, here is what you get:
from numpy import *
min.__module__
↳ 'builtins'
amin.__module__
↳ 'numpy.core.fromnumeric'
Notice that the min
function is still part of builtins but amin
is now part of
numpy, which means that numpy.min
is actually missing! The consequence is we have a really
hard time keeping track of which functions are coming from what package. This may all seem a bit
silly, but take my word for it, this can really cause headaches, so just avoid it.
Consoles: python, IPython, IDEs
So far we have been using the simple python interpreter to run python code, which is the most basic way to interact with python. However, there are other ways which try to give improved functionality to make to most of your python experience. Let's explore some of the basic methods in brief.
Sticking to the most basic, lets run a "hello world" from the python interpreter.
print("Hello world!")
↳ Hello world!
easy peasy.
Now for level 2, open a blank text file, using something like "wordpad" or "notepad", and enter the same command as before:
print("Hello world!")
Now save your file as something like "PythonHello.py" somewhere you can access easily, Once it's saved, open a terminal (not python, just plain terminal) and activate your environment. Finally, go to the directory with your file and type,
python PythonHello.py
↳ Hello world!
Voilà, you just ran your first python script! Just like when you type commands in the python interpreter, the lines in the script are run in sequence.
IPython
Through the years, many projects have tried to improve on the basic python interpreter to make writing and debugging easier. One of the most popular is called IPython. When you installed Spyder earlier, you actually installed IPython, because Spyder uses IPython to function. So you are all set to enter an IPython console. From a terminal, in your environment, simply type:
ipython
↳ Python 3.7.0 (default, Oct 9 2018, 10:31:47)
↳ Type 'copyright', 'credits' or 'license' for more information
↳ IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]:
This is now an IPython console, which gives a bit more interactive experience.
Try the following: in your IPython console, import numpy as np, then type
np.min(
but instead of hitting enter, hit tab. Like magic, you can see all the
potential inputs to the np.min
function. This is by far not the only advantage
of IPython, but most of them will come later when doing things like plotting.
Spyder
Going a step further past our interpreters and interactive consoles, we have
Integrated Development Environments or IDEs. There are many IDEs
for python, so many that we won't go into them. As a particular example, we
will look at Spyder, which is a powerful and
popular choice. If you remember, we installed Spyder
as an example earlier.
You can launch Spyder from the command line by simply typing
Spyder
, which should give initially start simple looking but functional GUI
with a couple of sub-windows. One sub-window should be the "editor", in which
you can write scripts, and another should be the IPython console. This is exactly
the same as if you were writing your script in a text editor and had your IPython
console open in a terminal, except you get an integrated environment. This can be really handy.
For example, if you open your "PythonHello.py" script from earlier, then with your
cursor in you script press "F5", you can see that your script runs in the IPython console
directly. Neat! There are plenty more nifty features for you to explore, but do
it on your own time.
Jupyter
Another powerful way to interact with python that has become increasingly popular in the last few years are interactive notebooks, the most common of which is called a Jupyter notebook. These are python instances you can interact with using your browser, and allow you to intermingle text and bits of code all in one document. These documents can be a good way to document code or just a nice environment to code in, they even render on sites such as github. You can see one such example here.
To get a Jupyter notebook up and running, you first need to install Jupyter (conda install jupyter
).
Once installed run the command jupyter notebook
into a terminal. This should open
a web browser with the Jupyter interface, if it faild, just copy and paste the URL from
the terminal (or click on the URL while holding Ctrl). You should see all the files
contained in the folder where you launched Jupyter. Now open a new notebook by
clicking new -> Python3. Now you are in a Jupyter notebook, from which you can
type bits of code in a block, and execute them (Run Cells). Go ahead and try a
Hello World.
Which way to interact with python?
As we have seen, there are many different ways to interact with python, from directly running commands in the python interpreter, to Jupyter notebooks. Any of these methods will work for the exercises in the following chapters, feel free to go back and forth and find what is comfortable.