Sunday, October 30, 2016
python pandas data analysis library quickstart introduction 1 read data pandas IO 2 database style merge inner join indexing on date time 3 write data
python pandas data analysis library quickstart introduction 1 read data pandas IO 2 database style merge inner join indexing on date time 3 write data
I discovered python pandas recently.
Using to read in ifconfig logs and add up total network traffic (VM network) across multiple hosts.
I needed to combine files from multiple hosts, sort values and combine by date and time.
I was sorting and parsing into python dicts.
I looked at perl PDL and other data manipulation libs.
But looks like python pandas wins because it combines VERY FLEXIBLE file IO with a lot of methods for selecting rows/columns and manipulating the data.
I made a simple test to make sure I knew how it all worked and described it here:
http://stackoverflow.com/questions/19222043/parse-two-files-and-merge-lines-if-time-stamp-matches/33100756#33100756
A flexible and general way of manipulating data is python pandas. Worth mentioning here as it really is the right tool for the job. Allows spreadsheet or database style merges/joins/concats on selected index rows or columns.
Two example files to illustrate how it works
$ cat File1
date0,time0,data01,data02,data03
date1,time1,data11,data12,data13
date2,time2,data21,data22,data23
date3,time3,data31,data32,data33
date4,time4,data41,data42,data43
date5,time5,data51,data52,data53
$ cat File2
date1,time1,data14
date4,time4,data44
date2,time2,data24
Run python . . .
- Use pandas read_csv to slurp in files in pandas table structure. (read_csv is very clever and can read in many formats not just csv)
- Use pandas merge to do inner(intersection of indices) join, using date+time as indices (index list=[0,1]).
- Use pandas to_csv to write output.
THE IMPORTANT BIT:
$ python
>>> from pandas import merge, read_csv
>>> f1=read_csv("File1",header=None)
>>> f2=read_csv("File2",header=None)
>>> merged = merge(f1, f2, how=inner, left_on=[0,1], right_on=[0,1])
>>> merged.to_csv("Out", na_rep=0, index=False, header=False)
>>> [Ctrl-D]
Job done!
$ cat Out
date1,time1,data11,data12,data13,data14
date2,time2,data21,data22,data23,data24
date4,time4,data41,data42,data43,data44
Easy as 1, 2, 3.
1. read data (pandas IO)
2. database style merge (inner join indexing on date+time)
3. write data.
VERY clean, no messing. I really do love bash/grep/sed/awk also perl and python manipulating data in structures BUT right tool for the job makes the job much easier and gives much more potential for use of the data.
Breakdown:
1. read_csv A bog-standard(plain, unadorned) read_csv("File1") treats first line as header names. So we use header=None.
>>> f1=read_csv("File1")
>>> f1
date0 time0 data01 data02 data03
0 date1 time1 data11 data12 data13
1 date2 time2 data21 data22 data23
2 date3 time3 data31 data32 data33
3 date4 time4 data41 data42 data43
4 date5 time5 data51 data52 data53
>>> f1=read_csv("File1",header=None)
>>> f1
0 1 2 3 4
0 date0 time0 data01 data02 data03
1 date1 time1 data11 data12 data13
2 date2 time2 data21 data22 data23
3 date3 time3 data31 data32 data33
4 date4 time4 data41 data42 data43
5 date5 time5 data51 data52 data53
>>> f2=read_csv("File2",header=None)
pandas DataFrame describe() gives a useful summary especially for big tables. For numeric data you also get total, max, min, mean, e.t.c.
>>> f1.describe()
0 1Go to link Download