Input/Output of files

This chapter explains the method to read and write data from/to commonly used data formats, such as text (csv), binary, excel, netCDF, and MATLAB, etc. The example data used for the explanation in this page can be downloaded from here.

Text Files

Small datasets are often stored in a structured or unstructured text format. Python libraries are able to read these data files in several ways.

Plain Text

First, we will load data from a free format structured text file (e.g., ASCII data).

import numpy as np
a=np.loadtxt('example_io/dat_plain.txt', comments='#',delimiter=None, converters=None, skiprows=0, usecols=None)

Reads the data in the text file as an 'array'. Will raise an error if the data is non-numeric (float or integer).

Comma Separated Text

Often the data values in text file are separated by special characters like tab, line breaks, or a comma. These separators can be excluded when reading the file by using the option 'delimiter' while using loadtxt.

import numpy as np
a=np.loadtxt('example_io/dat_csv.csv', delimiter=',',converters={0: datestr2num})

A full list of options of loadtxt is available here.

Unstructured Text

If the text data is unstructured, extra work is needed to read the file, and to process it to be saved as an array.

  • First the file has to be opened as an object €:
a=file('example_io/dat_unstructured.txt')
type(a)
↳ file
a_list=a.readlines() #Extracting data from the file as a 'list'€
a_str=a.read() #Extracting data from the file as a 'string'€
- readlines() reads contents (each line) of the file object 'a' and puts it in a a_list.

- read() reads contents of the file object 'a' and stores it as a string.

ASCII files are coded with special characters. These characters need to be removed from each line/item of the data using read or readlines.

- Drop the '€˜\n' or '€˜\r €˜\n' sign at the end of each line:

- strip() is used to remove these characters:

- To drop it from each element of a_list:
b=[s.strip() for s in a]
- Furthermore, to convert each element into float:
b=[float(s.strip()) for s in a]

Save Text File

  • To save an array 'a'€,
np.savetxt(filename, a, fmt='%.18e', delimiter=' ', newline='\n', header='', footer='', comments='# ')

A full list of options of savetxt is available here.

Binary Data

Read Binary Data

Binary data format is used because it uses smaller number of bytes to store each data, such that its efficient in using smaller memory. This section explains the procedure of reading and writing data in binary format using built-in function, fromfile.

  • filename is the name of the file.

  • type code: can be defined as type code (e.g., f) or python type (e.g., float) as shown in Table [t4-1]. It determines the size and byte-order of items in the binary file.

dat=fromfile('example_io/dat_binary.float32','f')
dat

Write Binary Data

  • To write/save all items (as machine values) of an array “A” to a file:
A.tofile('filename')
  • can also include the data type as,
A.astype('f').tofile('filename')

NetCDF Data

Here the handling of netCDF data using scipy.io functionalities are shown. Recently, there has been significant developments of xarray package that is built around netCDF conventions and has clear advantages in dealing with netCDF data. We will cover xarray in detail separately.

Read NetCDF Data

NetCDF data files can be read by several packages such as Scientific, Scipy, and NetCDF4. Below is an example of reading netCDF file using io module of Scipy.

from scipy.io import netcdf
ncfile=netcdf.netcdf_file('example_io/dat_netCDF.nc')
ncfile.variables
dat=ncf.variables['wbal_clim_CUM'][:]

Write NetCDF Data

A short example of how to create netCDF data is below. For details, refer to the original Scipy help page.

from scipy.io import netcdf
import numpy as np
f = netcdf.netcdf_file('simple.nc', 'w')
f.history = 'Created for a test'
f.createDimension('time', 10)
time = f.createVariable('time', 'i',('time',))
time[:] = np.arange(10)
time.units = 'days since 2008-01-01'
f.close()

Read MATLAB Data

MatLab data files can be read by using python interface for hdf5 dataset. Requires installation of h5py package.

import h5py
a=h5py.File('example_io/dat_matlab.mat')
a.keys()
↳ [u'#refs#', u'Results']
a['Results'].keys()
↳ [u'SimpGWvD']
dat=a['Results/SimpGWvD/Default/ModelOutput/ETmod'][:]
dat=a['Results']['SimpGWvD']['Default']['ModelOutput']['ETmod'][:]
a.close()

MATLAB often has a backward compatibility issue, and different versions can write .mat files with different internal structures and compressions. The example above works with -v7.3 or newer MATLAB versions (when -v7.3 option is used in save command of MATLAB). For older pre v7 versions, scipy.io provides a loadmat function.

Excel Data

There are several python libraries to read and write excel (xlsx) files. The following example shows how to parse an excel sheet into a pandas dataframe.

Read Excel Data

import pandas as pd
ex_f='example_io/dat_xls.xlsx'
xl = pd.ExcelFile(ex_f)
print(xl.sheet_names)
df = xl.parse('Belleville_96-pr',header=None)

Even straigtforward function to directly import a sheet and convert a proper pandas dataframe is:

import pandas as pd
ex_f='example_io/dat_xls.xlsx'
sheetName='Belleville_96-pr'
headNames=['date','data']
df = pd.read_excel(ex_f, header=None, sheetname = sheetName, names=headNames)

Write Excel Data

A pandas dataframe can also easily be written to excel using:

df.to_excel('excel_out.xlsx')

tiff Data

from PIL import Image
im = Image.open('example_io/dat_tiff.tif')

# im.show()
import numpy as np
im2array = np.array(im) #change image array to numpy array
array2im = Image.fromarray(im2array) # change numpy array to image
array2im.save('tiff_out.tif')