Data Operations in Python

Size and Shape

All data are characterized by two things: how big they are (size), and how they are arranged (shape). Here are some useful commands to play with the size and shape of data. The functions introduced in this section are based on numpy package. Readers are recommended to read the full documentation of numpy.

We will use the following array as an example:

import numpy as np
a=np.arange(1,11)

Check the dimension of the data:

np.size(a)
↳ 10  

Check the shape of the data:

To check the shape of a list.

np.shape(a)
↳ (10,)

For an array, shape, size and number of dimensions (among others) are also pre-stored as attributes. They can be accessed as

a.shape
↳ (10,)
a.size
↳ 10
a.ndim
↳ 1

Note that the attributes are accessed without the () at the end. In case of a method, one should use () at the end.

a.transpose() # only works if a is a numpy array
np.transpose(a) # use the transpose function from numpy in an array or a list a

Reshape the data

Numpy provides a straightforward way to reorder and reorganize the data in an array.

b=a.reshape(2,5)
  • Note that the order is number of rows (downward direction), number of columns (right) for 2-dimensional arrays in python.

  • Once again, the reshape as a method can be used in arrays only.

b=np.array([[1,2,3,4,5],[6,7,8,9,10]])
b=np.reshape(a,(2,5))

can be used for both array and list. List is converted to array automatically by this function.

b=a.reshape(-1,5)

By using the -1 flag, the first dimension is automatically set to match the total size of the array. For e.g., if there are 10 elements in an array/list and 5 columns is specified during reshape, number of rows is automatically calculated as 2. The shape will be (2,5).

  • Convert the data type to array:
b=np.array(a)
  • Convert the data type to list:
b=a.tolist()
  • Convert the data into float and integer:
float(a[0])
int(a[0])

float and int are native python functions.

Slicing and Indexing

This section explains how to extract data from an array or a list. The following process can be used to take data for a region from global data, or for a limited period from long time series data. The process is called 'slicing'.

As same method can be used for arrays and lists. Let's consider the following list,

a=[1,2,3,4,5]

There are five items in the list.

Index Basics

Indexing is done in two ways:

  1. Positive Index: The counting order is from left to right. The index for the first element is 0 (not 1).
a[0]
↳ 1  
a[1]
↳ 2
a[4]
↳ 5

The fifth item (index=4)is 5. 2. Negative Index: The counting order is from right to left. The index for the last item is -1. In some cases, the list is very long and it is much easier to count from the end rather than the beginning.

a[-1]
↳ 5

It is same as a[4]as shown above.

a[-2]
↳ 4

Data Extraction

Data extraction is carried out by using indices. In this section, some examples of using indices are provided. Details of array indexing and slicing can be found here.

  • Using two indices: somelist[first index:last index:(interval)]
a[0:2]
↳ [1,2]

a[0]and a[1]are included but a[2]is not included.

a[3:4]
↳ 4
  • Using single index:
a[:2]
↳ [1,2]

same as a[0:2].

a[2:]
↳ [3,4,5]

same as a[2:5].

  • Consider a 2-D list and 2-D array

Different method for array and list as indexing is different in two cases as explained below.

a_list=[[1,2,3],[4,5,6]]
a_array=np.array([[1,2,3],[4,5,6]])
shape(a_list)
↳ (2,3)
a_array.shape
↳ (2,3)
a_list[0]
↳ [1,2,3]

which is a list.

a_array[0]
↳ array([1,2,3])

which is an array.

  • To extract data from list,
a_list[0][1]
↳ 2
a_list[1][:2]
↳ [4,5]

The index has to be provided in two different sets of square brackets "[ ]".

  • To extract data from array,
a_array[0,1]
↳ 2
a_array[1,:2]
↳ [4,5]

The index can be provided is one set of square brackets "[ ]".

  • Consider a 3-D list and 3-D array,
a_list=[[[2,3],[4,5],[6,7],[8,9]],[[12,13],[14,15],[16,17],[18,19]]]
a_array=np.array([[[2,3],[4,5],[6,7],[8,9]],[[12,13],[14,15],[16,17],[18,19]]])

The shape of both data is (2,4,2).

To extract from list,

a_list[0][2]
↳ [6,7]
a_list[0][2][1]
↳ 6

To extract from array,

a_array[0,2]
↳ array([6,7])
a_array[0,2,1]
↳ 6

Built-in Mathematical Functions

The Python interpreter has a number of functions built into it. This section documents the Python's built-in functions in easy-to-use order. Firstly, consider the following 2-D arrays,

A=np.array([[-2, 2], [-5, 5]])
B=np.array([[2, 2], [5, 5]])
C=np.array([[2.53, 2.5556], [5.3678, 5.4568]])
  • max(iterable): Returns the maximum from the passed elements or if a single iterable is passed, the max element in the iterable. With two or more arguments, return the largest value.
np.max([0,10,15,30,100,-5])
↳ 100
A.max()
↳ 5
- for data/array with not a number (NaN) values, use nanmax.
  • min(iterable): Returns the minimum from the passed elements or if a single iterable is passed, the minimum element in the iterable. With two or more arguments, return the smallest value.
np.min([0,10,15,30,100,-5])
↳ -5
A.min()
↳ -5
- for data/array with not a number (NaN) values, use nanmin.
  • mean(iterable): Returns the average of the array elements. The average is taken over the flattened array by default, otherwise over the specified axis. For details, clickhere.
np.mean([0,10,15,30,100,-5])
↳ 75 
A.mean()
↳ 0.0 
- for data/array with not a number (NaN) values, use nanmean.
  • median(iterable): Returns the median of the array elements.
np.median([0,10,15,30,100,-5])
↳ 12.5 
A.median()
↳ 0.0 
- for data/array with not a number (NaN) values, use nanmedian.
  • sum(iterable): Returns the sum of the array elements. It returns sum of array elements over an axis if axis is specified else sum of all elements. For details, click here.
np.sum([1,2,3,4])
↳ 10 
A.sum()
↳ 0 
- for data/array with not a number (NaN) values, use nansum.
  • abs(A): Returns the absolute value of a number, which can be an integer or a float, or an entire array.
np.abs(A)
↳ array([[2,2],[5,5]])
abs(B)
↳ array([2, 2],[5, 5])
  • divmod(x,y): Returns the quotient and remainder resulting from dividing the first argument (some number x or an array)by the second (some number y or an array).
np.divmod(2, 3)
↳ (0, 2)

as 2 / 3 = 0 and remainder is 2.

np.divmod(4, 2)
↳ (2, 0)
- as 4 / 2 = 2 and remainder is 0.

- In case of two dimensional array data
np.divmod(A,B)
↳ (array([[-1, 1], [-1, 1]]), array([[0, 0],[0, 0]]))
  • modulo (x%y): Returns the remainder of a division of x by y.
5
↳ 1
  • pow(x,y[, z]): Returns x to the power y. But, if z is present, returns x to the power y modulo z (more efficient than pow(x, y)%z). The pow(x, y)is equivalent to x**y.
np.pow(A,B)
↳ array([[4, 4], [-3125, 3125]])
  • round(x,n): Returns the floating point value of x rounded to n digits after the decimal point.
np.round(2.675,2)
↳ 2.67
  • around(A,n): Returns the floating point array A rounded to n digits after the decimal point.
np.around(C,2)
↳ array([[ 2.53, 2.56], [ 5.37, 5.46]])
  • range([x],y[,z]): This function creates lists of integers in an arithmetic progression. It is primarily used in for loops. The arguments must be plain integers.

    • If the step argument is omitted, it defaults to 1.

    • If the start argument (x)is omitted, it defaults to 0.

    • The full form returns a list of plain integers [x, x + z, x + 2*z, ... ,y-z].

    • If step (z)is positive, the last element is the 'start (x)+i * step (z)' just less than 'y'.

    • If step (z)is negative, the last element is the 'start (x)+ i * step (z)' just greater than 'y'.

    • If step (z)is zero, ValueError is raised.

np.range(10)
↳ [0,1,2,3,4,5,6,7,8,9]
np.range(1,11)
↳ [1,2,3,4,5,6,7,8,9,10]
np.range(0,20,5)
↳[0,5,10,15]
np.range(0,-5,-1)
↳[0,-1,-2,-3,-4]
np.range(0)
↳ [ ]
  • arange(x,y[,z]): This function creates arrays of integers in an arithmetic progression. Same as in range().
np.arange(10)
↳ array([0,1,2,3,4,5,6,7,8,9])
np.arange(1,11)
↳ array([1,2,3,4,5,6,7,8,9,10])
np.arange(0,20,5)
↳array([0,5,10,15])
np.arange(0,-5,-1)
↳ array([0,-1,-2,-3,-4])
np.arange(0)
↳ array([],dtype=int64)
  • zip(A,B): Returns a list of tuples, where each tuple contains a pair of i$^{th}$ element of each argument sequences. The returned list is truncated to length of shortest sequence. For a single sequence argument, it returns a list with 1 tuple. With no arguments, it returns an empty list.
np.zip(A,B)
↳ [(array([-2, 2]), array([2, 2])), (array([-5,5]), array([5, 5]))]
  • sort(): Sorts the array elements in smallest to largest order.
D=np.array([10,2,3,10,100,54])
D.sort()
D
↳ array([2, 3, 10, 10, 54, 100])
  • ravel(): Returns a flattened array. 2-D array is converted to 1-D array.
A.ravel()
↳ array([-2, 2, -5, 5])
  • transpose(): Returns the transpose of an array (matrix)by permuting the dimensions.
A.transpose()
↳ array([[-2, -5], [ 2, 5]])
  • diagonal(): Returns diagonal matrixs for pecified diagonals.
A.diagonal()
↳ array([-2, 5])

Matrix operations

The linear algebra module of Numpy provides a suit of Matrix calculations.

  • Dot product:
a=np.random.rand(3,3)
b=np.random.rand(3,3)
dot_p=np.dot(a,b)
- where a and b are two arrays.
- rand is a function provided by numpy.random submodule
  • Cross product:
a=np.random.rand(3,3)
b=np.random.rand(3,3)
cro_p=np.cross(a,b)
- where a and b are two arrays.
  • Matrix multiplication:
a=np.random.rand(2,3)
b=np.random.rand(3,2)
mult_ab=np.matmul(a,b)
np.shape(mult_ab)
↳ (2,2)

String Operations

Lets assume a string s as,

s='Unicode String' 
  • split(): Splitting the strings. It has one required argument, a delimiter. The method splits a string into a list of strings based on the delimiter.
s.split()
↳ ['Unicode','String']
- blank space as delimiter. creates a list with elements separated at locations of blank space.
s.split('i')
↳ ['Un', 'code Str', 'ng']
- 'i' as delimiter. creates a list with elements separated at locations of 'i'
  • lower()and upper(): Changes the string to lower case and upper case respectively.
s='Unicode String'
s.lower()
↳ 'unicode string' 
s.upper()
↳ 'UNICODE STRING' 
  • count(): Counts the number of occurrences of a substring.
s.count('i')
↳ 2

s.count('I')
↳ 0

There are 3 i's in string s, but no I. Python is case-sensitive.

  • Replace a substring:
s2=s.replace("Su", "Tsu")
  • List to String:
a_list=['a','b','c']
a_str=" and ".join(str(x) for x in a_list)
a_str  
↳ 'a and b and c'

Other Useful Functions

  • astype(type code): Returns an array with the same elements coerced to the type indicated by type code in Table [t4-1]. It is useful to save data as some type.
A.astype('f')
↳ array([[-2., 2.],[-5., 5.]])
  • tolist(): Converts the array to an ordinary list with the same items.
A.tolist()
↳ [[-2, 2], [-5, 5]]
  • byteswap(): Swaps the bytes in an array and returns the byteswapped array. If the first argument is True, it byteswaps and returns all items of the array in-place. Supported byte sizes are 1, 2, 4, or 8. It is useful when reading data from a file written on a machine with a different byte order. For details on machine dependency, refer this. To convert data from big endian to little endian or vice-versa, add byteswap()in same line where 'fromfile' is used. If your data is made by big endian.