Pandas and Friends

What does it do?

Pandas is a Python data analysis tool built on top of NumPy that provides a suite of data structures and data manipulation functions to work on those data structures. It is particularly well suited for working with time series data.

Getting Started - Installation

Installing with pip or apt-get::

pip install pandas
# or
sudo apt-get install python-pandas

Mac - Homebrew or MacPorts to get the dependencies, then pip
Windows - Python(x,y)?
Commercial Pythons: Anaconda, Canopy

Getting Started - Dependencies

Dependencies, required, recommended and optional

# Required
numpy, python-dateutil, pytx
# Recommended
numexpr, bottleneck
# Optional
cython, scipy, pytables, matplotlib, statsmodels, openpyxl

Pandas’ Friends!

Pandas works along side and is built on top of several other Python projects.

IPython
Numpy
Matplotlib

Pandas gets along with EVERYONE!

Background - IPython

IPython is a fancy python console. Try running ipython or ipython --pylab on your command line. Some IPython tips

# Special commands, 'magic functions', begin with %
%quickref, %who, %run, %reset
# Shell Commands
ls, cd, pwd, mkdir
# Need Help?
help(), help(obj), obj?, function?
# Tab completion of variables, attributes and methods

Background - IPython Notebook

There is a web interface to IPython, known as the IPython notebook, start it like this

ipython notebook
# or to get all of the pylab components
ipython notebook --pylab

IPython - Follow Along

Follow along by connecting to TMPNB.ORG!

http://tmpnb.org

Background - NumPy

NumPy is the foundation for Pandas
Numerical data structures (mostly Arrays)
Operations on those.
Less structure than Pandas provides.

Background - NumPy - Arrays

import numpy as np
# np.zeros, np.ones
data0 = np.zeros((2, 4))

data0

array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

# Make an array with 20 entries 0..19
data1 = np.arange(20)
# print the first 8
data1[0:8]

array([0, 1, 2, 3, 4, 5, 6, 7])

Background - NumPy - Arrays

# make it a 4,5 array
data = np.arange(20).reshape(4, 5)
data

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

Background - NumPy - Arrays

Arrays have NumPy specific types, dtypes, and can be operated on.

print("dtype: ", data.dtype)
result = data * 20.5
print(result)

('dtype: ', dtype('int64'))
[[   0.    20.5   41.    61.5   82. ]
 [ 102.5  123.   143.5  164.   184.5]
 [ 205.   225.5  246.   266.5  287. ]
 [ 307.5  328.   348.5  369.   389.5]]

Now, on to Pandas

Pandas

Tabular, Timeseries, Matrix Data - labeled or not
Sensible handling of missing data and data alignment
Data selection, slicing and reshaping features
Robust data import utilities.
Advanced time series capabilities

Data Structures

Series - 1D labeled array
DataFrame - 2D labeled array
Panel - 3D labeled array (More D)

Assumed Imports

In my code samples, assume I import the following

import pandas as pd
import numpy as np

Series

one-dimensional labeled array
holds any data type
axis labels known as index
implicit integert indexes
dict-like

Create a Simple Series

s1 = pd.Series([1, 2, 3, 4, 5])
s1

0    1
1    2
2    3
3    4
4    5
dtype: int64

Series Operations

# integer multiplication
print(s1 * 5)

0     5
1    10
2    15
3    20
4    25
dtype: int64

Series Operations - Cont.

# float multiplication
print(s1 * 5.0)

0     5
1    10
2    15
3    20
4    25
dtype: float64

Series Index

s2 = pd.Series([1, 2, 3, 4, 5],
               index=['a', 'b', 'c', 'd', 'e'])
s2

a    1
b    2
c    3
d    4
e    5
dtype: int64

Date Convenience Functions

A quick aside …

dates = pd.date_range('20130626', periods=5)
print(dates)
print()
print(dates[0])

<class 'pandas.tseries.index.DatetimeIndex'>
[2013-06-26, ..., 2013-06-30]
Length: 5, Freq: D, Timezone: None
()
2013-06-26 00:00:00

Datestamps as Index

s3 = pd.Series([1, 2, 3, 4, 5], index=dates)
print(s3)

2013-06-26    1
2013-06-27    2
2013-06-28    3
2013-06-29    4
2013-06-30    5
Freq: D, dtype: int64

Selecting By Index

Note that the integer index is retained along with the new date index.

print(s3[0])
print(type(s3[0]))
print()
print(s3[1:3])
print(type(s3[1:3]))

1
<type 'numpy.int64'>
()
2013-06-27    2
2013-06-28    3
Freq: D, dtype: int64
<class 'pandas.core.series.Series'>

Selecting by value

s3[s3 < 3]

2013-06-26    1
2013-06-27    2
Freq: D, dtype: int64

Selecting by Label (Date)

s3['20130626':'20130628']

2013-06-26    1
2013-06-27    2
2013-06-28    3
Freq: D, dtype: int64

Series Wrapup

Things not covered but you should look into:

Other instantiation options: dict
Operator Handling of missing data NaN
Reforming Data and Indexes
Boolean Indexing
Other Series Attributes:
- index - index.name
- name - Series name

DataFrame

2-dimensional labeled data structure
Like a SQL Table, Spreadsheet or dict of Series objects.
Columns of potentially different types
Operations, slicing and other behavior just like Series

DataFrame - Simple

data1 = pd.DataFrame(np.random.rand(4, 4))
data1

	0	1	2	3
0	0.002581	0.851980	0.097265	0.648841
1	0.732965	0.820690	0.895176	0.582483
2	0.176504	0.068942	0.466759	0.918777
3	0.938426	0.097954	0.696534	0.684424

DataFrame - Index/Column Names

dates = pd.date_range('20130626', periods=4)
data2 = pd.DataFrame(
    np.random.rand(4, 4),
    index=dates, columns=list('ABCD'))
data2

	A	B	C	D
2013-06-26	0.831222	0.209279	0.340186	0.928447
2013-06-27	0.252513	0.452392	0.862822	0.738837
2013-06-28	0.309083	0.822918	0.924720	0.964607
2013-06-29	0.827998	0.539519	0.248369	0.377682

DataFrame - Operations

data2['E'] = data2['B'] + 5 * data2['C']
data2

	A	B	C	D	E
2013-06-26	0.831222	0.209279	0.340186	0.928447	1.910210
2013-06-27	0.252513	0.452392	0.862822	0.738837	4.766505
2013-06-28	0.309083	0.822918	0.924720	0.964607	5.446516
2013-06-29	0.827998	0.539519	0.248369	0.377682	1.781366

See? You never need Excel again!

DataFrame - Column Access

Deleting a column.

# Deleting a Column
del data2['E']
data2

	A	B	C	D
2013-06-26	0.831222	0.209279	0.340186	0.928447
2013-06-27	0.252513	0.452392	0.862822	0.738837
2013-06-28	0.309083	0.822918	0.924720	0.964607
2013-06-29	0.827998	0.539519	0.248369	0.377682

DataFrame

Remember this, data2, for the next examples.

data2

	A	B	C	D
2013-06-26	0.831222	0.209279	0.340186	0.928447
2013-06-27	0.252513	0.452392	0.862822	0.738837
2013-06-28	0.309083	0.822918	0.924720	0.964607
2013-06-29	0.827998	0.539519	0.248369	0.377682

DataFrame - Column Access

As a dict

data2['B']

2013-06-26    0.209279
2013-06-27    0.452392
2013-06-28    0.822918
2013-06-29    0.539519
Freq: D, Name: B, dtype: float64

DataFrame - Column Access

As an attribute

data2.B

2013-06-26    0.209279
2013-06-27    0.452392
2013-06-28    0.822918
2013-06-29    0.539519
Freq: D, Name: B, dtype: float64

DataFrame - Row Access

By row label

data2.loc['20130627']    

A    0.252513
B    0.452392
C    0.862822
D    0.738837
Name: 2013-06-27 00:00:00, dtype: float64

DataFrame - Row Access

By integer location

data2.iloc[1]

A    0.252513
B    0.452392
C    0.862822
D    0.738837
Name: 2013-06-27 00:00:00, dtype: float64

DataFrame - Cell Access

Access column, then row or use iloc and row/column indexes.

print(data2.B[0])
print(data2['B'][0])
print(data2.iloc[0,1])  # [row,column] 

0.209279059059
0.209279059059
0.209279059059

DataFrame - Taking a Peek

Look at the beginning of the DataFrame

data3 = pd.DataFrame(np.random.rand(100, 4))
data3.head()

	0	1	2	3
0	0.796264	0.332496	0.860904	0.488276
1	0.405906	0.309003	0.159129	0.597427
2	0.107366	0.791943	0.080191	0.187125
3	0.176196	0.931741	0.742967	0.953014
4	0.567175	0.673101	0.069275	0.208249

DataFrame - Taking a Peek

Look at the end of the DataFrame.

data3.tail()

	0	1	2	3
95	0.175699	0.187918	0.407732	0.441582
96	0.638801	0.264603	0.210135	0.721955
97	0.947213	0.674040	0.087639	0.240926
98	0.220907	0.309761	0.659022	0.894547
99	0.452450	0.365101	0.043229	0.911712

DataFrame Wrap Up

Just remember,

A DataFrame is just a bunch of Series grouped together.
Any one dimensional slice returns a Series
Any two dimensional slice returns another DataFrame.
Elements are typically NumPy types or Objects.

Panel

Like DataFrame but 3 or more dimensions.

IO Tools

Robust IO tools to read in data from a variety of sources

Plotting

Matplotlib - s.plot() - Standard Python Plotting Library
Trellis - rplot() - An ‘R’ inspired Matplotlib based plotting tool

Bringing it Together - Data

The csv file (phx-temps.csv) contains Phoenix weather data from GSOD::

1973-01-01 00:00:00,53.1,37.9
1973-01-02 00:00:00,57.9,37.0
...
2012-12-30 00:00:00,64.9,39.0
2012-12-31 00:00:00,55.9,41.0

Bringing it Together - Code

Simple read_csv()

# simple readcsv
phxtemps1 = pd.read_csv('phx-temps.csv')
phxtemps1.head()

	1973-01-01 00:00:00	53.1	37.9
0	1973-01-02 00:00:00	57.9	37.0
1	1973-01-03 00:00:00	59.0	37.0
2	1973-01-04 00:00:00	57.9	41.0
3	1973-01-05 00:00:00	54.0	39.9
4	1973-01-06 00:00:00	55.9	37.9

Bringing it Together - Code

Advanced read_csv(), parsing the dates and using them as the index, and naming the columns.

# define index, parse dates, name columns
phxtemps2 = pd.read_csv(
    'phx-temps.csv', index_col=0,
    names=['highs', 'lows'], parse_dates=True)
phxtemps2.head()

	highs	lows
1973-01-01	53.1	37.9
1973-01-02	57.9	37.0
1973-01-03	59.0	37.0
1973-01-04	57.9	41.0
1973-01-05	54.0	39.9

Bringing it Together - Plot

import matplotlib.pyplot as plt
%matplotlib inline 
phxtemps2.plot()  # pandas convenience method

<matplotlib.axes._subplots.AxesSubplot at 0x7fca44e8e550>

png

Boo, Pandas and Friends would cry if they saw such a plot.

Bringing it Together - Plot

Lets see a smaller slice of time:

phxtemps2['20120101':'20121231'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fca3c66f1d0>

png

Bringing it Together - Plot

Lets operate on the DataFrame … lets take the differnce between the highs and lows.

phxtemps2['diff'] = phxtemps2.highs - phxtemps2.lows
phxtemps2['20120101':'20121231'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fca3c6ba650>

png

Pandas Alternatives

AstroPy seems to have similar data structures.
I suspect there are others.

References

Thanks! - Pandas and Friends

Austin Godber
Mail: godber@uberhip.com
Twitter: @godber
Presented at DesertPy, Jan 2015.
Ipython Notebook