Lecture 9. Modules - pandas Matthias Bieg
pandas: Intro
Was ist pandas pandas ist ein Modul, das Datencontainer anbietet, ähnlich den DataFrames in R Die wich gsten Datencontainer sind: pandas.series, und pandas.dataframe Zusätzlich zu den Containern gibt es eine Reihe von Analysetools, sowie die Möglichkeit zum Plo en
pandas.series Series sind Listen Die Element können mit Iden fiern versehen werden, um einen intui ven Zugriff auf die Liste zu gewährleisten Die einzelnen Elemente können von unterschiedlichem Typ sein (integer, float, string, etc...) idx1 idx2... idxm-1 idxm d1 d2... dm-1 dm
pandas.dataframe DataFrames sind 2 dimensionale Matrizen Die Zeilen uns Spalten können mit Iden fiern versehen werden, um einen intui ven Zugriff auf einzelne Elemente zu gewährleisten Die einzelnen Elemente können von unterschiedlichem Typ sein (integer, float, string, etc...) Col1 Col2... Coln-1 Coln idx1 d1,1 d1,2... d1,n-1 d1,n idx2 d2,1 d2,2... d2,n-1 d2,n.................. idxm-1 dm-1,1 dm-1,2... dm-1,n-1 dm-1,n idxm dm,1 dm,2... dm,n-1 dm,n
pandas: Object Creation
Creating a Series In [12]: import numpy as np import pandas as pnd mu = 181 sigma = 10. # List of random body heights male_heights = [ sigma * i + mu for i in np.random.randn(10) ] # List of Index Names index = [ "Individual"+str(i) for i in range(1, 11) ] # Create pandas.series object male_heights_series = pnd.series(male_heights, index=index) print male_heights_series Individual1 182.305315 Individual2 174.399807 Individual3 183.832186 Individual4 190.007618 Individual5 175.869070 Individual6 182.522929 Individual7 169.786862 Individual8 185.835162 Individual9 173.510661 Individual10 176.082611 dtype: float64
Create a DataFrame By passing a numpy array In [20]: import numpy as np import pandas as pnd # Create a list of lists wit random elements mu = 181 l = [ [ j * sigma + mu+offset for j in np.random.randn(5) ] for offset in range(8) ] # Defin column and row identifiers columns = [ "Individual"+str(i) for i in range(1, 6) ] index = [ "group"+str(j) for j in range(8) ] # Create DataFrame male_heights_dataframe = pnd.dataframe(np.array(l), columns=columns, index=index) print male_heights_dataframe Individual1 Individual2 Individual3 Individual4 Individual5 group0 177.383691 187.517906 182.945059 178.020507 169.232070 group1 179.623133 178.138221 188.529097 185.312991 181.738726 group2 178.894888 182.964432 181.898760 183.025629 188.742420 group3 188.774235 179.007723 183.765283 168.775728 171.615213 group4 186.496567 193.176500 190.919088 176.547197 176.048604 group5 196.132422 183.594067 196.302713 184.401609 195.707818 group6 200.641538 172.983882 189.127611 190.970565 204.081392 group7 189.992194 191.152608 184.546473 200.404814 177.099218
By reading a tab-separated le Das pandas Modul bietet eine Methode zum Einlesen von tab-separierten Datein in pandas.dataframe Objekte: Die Methode pandas.read_csv In [49]: import pandas as pnd male_heights_filename = "data/male_heights.csv" # Read CSV file into pandas.dataframe male_heights_dataframe = pnd.read_csv(male_heights_filename, sep="\t", index_col=0) print male_heights_dataframe Individual1 Individual2 Individual3 Individual4 Individual5 group0 177.383691 187.517906 182.945059 178.020507 169.232070 group1 179.623133 178.138221 188.529097 185.312991 181.738726 group2 178.894888 182.964432 181.898760 183.025629 188.742420 group3 188.774235 179.007723 183.765283 168.775728 171.615213 group4 186.496567 193.176500 190.919088 176.547197 176.048604 group5 196.132422 183.594067 196.302713 184.401609 195.707818 group6 200.641538 172.983882 189.127611 190.970565 204.081392 group7 189.992194 191.152608 184.546473 200.404814 177.099218
pandas: View Data
top and bottom of DataFrame In [32]: import pandas as pnd df = pnd.read_csv("data/male_heights.csv", sep="\t", index_col=0) # View top of Dataframe print df.head(3) # View bottom of DataFrame print df.tail(2) Individual1 Individual2 Individual3 Individual4 Individual5 group0 177.383691 187.517906 182.945059 178.020507 169.232070 group1 179.623133 178.138221 188.529097 185.312991 181.738726 group2 178.894888 182.964432 181.898760 183.025629 188.742420 Individual1 Individual2 Individual3 Individual4 Individual5 group6 200.641538 172.983882 189.127611 190.970565 204.081392 group7 189.992194 191.152608 184.546473 200.404814 177.099218
Index, column, values In [34]: import pandas as pnd df = pnd.read_csv("data/male_heights.csv", sep="\t", index_col=0) # Inspect row identifiers print df.index # Inspect column identifiers print df.columns # Inspect values print df.values Index([u'group0', u'group1', u'group2', u'group3', u'group4', u'group5', u'group6', u'group7'], dtype='object') Index([u'Individual1', u'individual2', u'individual3', u'individual4', u'individual5'], dtype='object') [[ 177.38369094 187.51790556 182.94505876 178.02050731 169.23207042] [ 179.62313257 178.1382207 188.52909722 185.31299054 181.73872603] [ 178.89488799 182.96443159 181.89876004 183.02562885 188.74242018] [ 188.774235 179.00772279 183.76528318 168.77572825 171.61521283] [ 186.4965667 193.17650011 190.91908795 176.54719711 176.04860419] [ 196.1324217 183.594067 196.30271347 184.40160934 195.70781803] [ 200.64153809 172.98388189 189.12761056 190.97056472 204.08139213] [ 189.99219434 191.15260754 184.54647267 200.40481373 177.09921787]]
Summary statistics of DataFrame rows In [41]: import pandas as pnd df = pnd.read_csv("data/male_heights.csv", sep="\t", index_col=0) # Get summary statistics of DataFrame columns print df.describe() Individual1 Individual2 Individual3 Individual4 Individual5 count 8.000000 8.000000 8.000000 8.000000 8.000000 mean 187.242333 183.566917 187.254260 183.432380 183.033183 std 8.400356 6.846464 4.887240 9.569601 12.205277 min 177.383691 172.983882 181.898760 168.775728 169.232070 25% 179.441071 178.790347 183.560227 177.652180 174.940256 50% 187.635401 183.279249 186.537785 183.713619 179.418972 75% 191.527251 188.426581 189.575480 186.727384 190.483770 max 200.641538 193.176500 196.302713 200.404814 204.081392
columns In [43]: import pandas as pnd df = pnd.read_csv("data/male_heights.csv", sep="\t", index_col=0) # Get summary statistics of DataFrame rows print df.t.describe() group0 group1 group2 group3 group4 group5 \ count 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 mean 179.019847 182.668433 183.105226 178.387636 184.637591 191.227726 std 6.839235 4.242703 3.570246 8.298185 7.985160 6.609678 min 169.232070 178.138221 178.894888 168.775728 176.048604 183.594067 25% 177.383691 179.623133 181.898760 171.615213 176.547197 184.401609 50% 178.020507 181.738726 182.964432 179.007723 186.496567 195.707818 75% 182.945059 185.312991 183.025629 183.765283 190.919088 196.132422 max 187.517906 188.529097 188.742420 188.774235 193.176500 196.302713 group6 group7 count 5.000000 5.000000 mean 191.560997 188.639061 std 12.151087 8.609574 min 172.983882 177.099218 25% 189.127611 184.546473 50% 190.970565 189.992194 75% 200.641538 191.152608 max 204.081392 200.404814
Sort by column values In [48]: import pandas as pnd df = pnd.read_csv("data/male_heights.csv", sep="\t", index_col=0) print df.sort(columns="individual1") Individual1 Individual2 Individual3 Individual4 Individual5 group0 177.383691 187.517906 182.945059 178.020507 169.232070 group2 178.894888 182.964432 181.898760 183.025629 188.742420 group1 179.623133 178.138221 188.529097 185.312991 181.738726 group4 186.496567 193.176500 190.919088 176.547197 176.048604 group3 188.774235 179.007723 183.765283 168.775728 171.615213 group7 189.992194 191.152608 184.546473 200.404814 177.099218 group5 196.132422 183.594067 196.302713 184.401609 195.707818 group6 200.641538 172.983882 189.127611 190.970565 204.081392 /usr/local/lib/python2.7/dist-packages/ipykernel_launcher.py:5: FutureWarning: s ort(columns=...) is deprecated, use sort_values(by=...) """
Summary Methode DataFrame.head(x) DataFrame.tail(x) DataFrame.index DataFrame.columns DataFrame.values DataFrame.describe() DataFrame.sort(columns=[c1, c2,...]) Bedeutung Gibt die ersten x Zeilen zurück Gibt die letzten x Zeilen zurück Gibt die Zeilen Iden fier zurück Gibt die Spalten Iden fier zurück Gibt die Werte des DataFrames zurück Gibt Summarysta s ken für die einzelnen Spalten aus Gibt sor erten DataFrame zurück
Data Accession
Getting In [58]: import pandas as pnd df = pnd.read_csv("data/male_heights.csv", sep="\t", index_col=0) # Select a single column print df["individual2"] group0 187.517906 group1 178.138221 group2 182.964432 group3 179.007723 group4 193.176500 group5 183.594067 group6 172.983882 group7 191.152608 Name: Individual2, dtype: float64 In [56]: # Selecting via [], which slices the rows print df[0:3] Individual1 Individual2 Individual3 Individual4 Individual5 group0 177.383691 187.517906 182.945059 178.020507 169.232070 group1 179.623133 178.138221 188.529097 185.312991 181.738726 group2 178.894888 182.964432 181.898760 183.025629 188.742420
Selection by Label In [61]: import pandas as pnd df = pnd.read_csv("data/male_heights.csv", sep="\t", index_col=0) # Getting a row print df.loc["group1"] Individual1 179.623133 Individual2 178.138221 Individual3 188.529097 Individual4 185.312991 Individual5 181.738726 Name: group1, dtype: float64 In [63]: # Getting selected columns of all rows print df.loc[:, ["Individual2", "Individual4"]] Individual2 Individual4 group0 187.517906 178.020507 group1 178.138221 185.312991 group2 182.964432 183.025629 group3 179.007723 168.775728 group4 193.176500 176.547197 group5 183.594067 184.401609 group6 172.983882 190.970565 group7 191.152608 200.404814
In [65]: import pandas as pnd df = pnd.read_csv("data/male_heights.csv", sep="\t", index_col=0) # Slice rows by row ids print df.loc["group3": "group6", :] Individual1 Individual2 Individual3 Individual4 Individual5 group3 188.774235 179.007723 183.765283 168.775728 171.615213 group4 186.496567 193.176500 190.919088 176.547197 176.048604 group5 196.132422 183.594067 196.302713 184.401609 195.707818 group6 200.641538 172.983882 189.127611 190.970565 204.081392 179.007722788 In [66]: # Accessing a single cell print df.loc["group3", "Individual2"] 179.007722788
Selection by Position In [67]: import pandas as pnd df = pnd.read_csv("data/male_heights.csv", sep="\t", index_col=0) # Select row via position of the passed integers print df.iloc[3] Individual1 188.774235 Individual2 179.007723 Individual3 183.765283 Individual4 168.775728 Individual5 171.615213 Name: group3, dtype: float64 In [70]: # Slice rows and columns print df.iloc[2:4, 1:3] Individual2 Individual3 group2 182.964432 181.898760 group3 179.007723 183.765283
In [72]: # Select by integer positions of rows and columns print df.iloc[[0, 3, 4], [1, 2]] Individual2 Individual3 group0 187.517906 182.945059 group3 179.007723 183.765283 group4 193.176500 190.919088 In [74]: # Slice rows explicitly print df.iloc[0:3, :] Individual1 Individual2 Individual3 Individual4 Individual5 group0 177.383691 187.517906 182.945059 178.020507 169.232070 group1 179.623133 178.138221 188.529097 185.312991 181.738726 group2 178.894888 182.964432 181.898760 183.025629 188.742420 In [75]: # Slice columns explicitly print df.iloc[:, 0:3] Individual1 Individual2 Individual3 group0 177.383691 187.517906 182.945059 group1 179.623133 178.138221 188.529097 group2 178.894888 182.964432 181.898760 group3 188.774235 179.007723 183.765283 group4 186.496567 193.176500 190.919088 group5 196.132422 183.594067 196.302713 group6 200.641538 172.983882 189.127611 group7 189.992194 191.152608 184.546473
Boolean Indexing In [81]: import pandas as pnd df = pnd.read_csv("data/male_heights.csv", sep="\t", index_col=0) # Using a single columns values to select data print df[df["individual1"] > 180] Individual1 Individual2 Individual3 Individual4 Individual5 group3 188.774235 179.007723 183.765283 168.775728 171.615213 group4 186.496567 193.176500 190.919088 176.547197 176.048604 group5 196.132422 183.594067 196.302713 184.401609 195.707818 group6 200.641538 172.983882 189.127611 190.970565 204.081392 group7 189.992194 191.152608 184.546473 200.404814 177.099218 In [82]: # Get all values fullfilling a criterion print df[df > 190] Individual1 Individual2 Individual3 Individual4 Individual5 group0 NaN NaN NaN NaN NaN group1 NaN NaN NaN NaN NaN group2 NaN NaN NaN NaN NaN group3 NaN NaN NaN NaN NaN group4 NaN 193.176500 190.919088 NaN NaN group5 196.132422 NaN 196.302713 NaN 195.707818 group6 200.641538 NaN NaN 190.970565 204.081392 group7 NaN 191.152608 NaN 200.404814 NaN
In [88]: # Using the isin() method for filtering df2 = df.copy().loc[:, "Individual1":"Individual4"] # Put new column with subgroup information to the end of DataFrame df2["subgroup"] = ["one", "two", "two", "three", "two", "one", "three", "one"] print df2 Individual1 Individual2 Individual3 Individual4 Subgroup group0 177.383691 187.517906 182.945059 178.020507 one group1 179.623133 178.138221 188.529097 185.312991 two group2 178.894888 182.964432 181.898760 183.025629 two group3 188.774235 179.007723 183.765283 168.775728 three group4 186.496567 193.176500 190.919088 176.547197 two group5 196.132422 183.594067 196.302713 184.401609 one group6 200.641538 172.983882 189.127611 190.970565 three group7 189.992194 191.152608 184.546473 200.404814 one In [89]: print df2[df2["subgroup"].isin(["one", "three"])] Individual1 Individual2 Individual3 Individual4 Subgroup group0 177.383691 187.517906 182.945059 178.020507 one group3 188.774235 179.007723 183.765283 168.775728 three group5 196.132422 183.594067 196.302713 184.401609 one group6 200.641538 172.983882 189.127611 190.970565 three group7 189.992194 191.152608 184.546473 200.404814 one
Setting Values In [102]: import pandas as pnd df = pnd.read_csv("data/male_heights.csv", sep="\t", index_col=0).loc[:, "Individua l1":"individual3"] # Set a new column using a oandas.series heights_individual4 = [183., 182., 170., 172., 165., 185., 188., 190.] individual4_series = pnd.series(heights_individual4, index=["group"+str(i) for i in range(8)]) df["individual4"] = individual4_series print df Individual1 Individual2 Individual3 Individual4 group0 177.383691 187.517906 182.945059 183 group1 179.623133 178.138221 188.529097 182 group2 178.894888 182.964432 181.898760 170 group3 188.774235 179.007723 183.765283 172 group4 186.496567 193.176500 190.919088 165 group5 196.132422 183.594067 196.302713 185 group6 200.641538 172.983882 189.127611 188 group7 189.992194 191.152608 184.546473 190
In [103]: # Setting values by label df.loc["group0", "Individual1"] = 5 # Setting values by position df.iloc[1, 1] = 6 print df Individual1 Individual2 Individual3 Individual4 group0 5.000000 187.517906 182.945059 183 group1 179.623133 6.000000 188.529097 182 group2 178.894888 182.964432 181.898760 170 group3 188.774235 179.007723 183.765283 172 group4 186.496567 193.176500 190.919088 165 group5 196.132422 183.594067 196.302713 185 group6 200.641538 172.983882 189.127611 188 group7 189.992194 191.152608 184.546473 190 In [104]: # Setting values by boolean indexing df[df > 180] = 10. print df Individual1 Individual2 Individual3 Individual4 group0 5.000000 10.000000 10 10 group1 179.623133 6.000000 10 10 group2 178.894888 10.000000 10 170 group3 10.000000 179.007723 10 172 group4 10.000000 10.000000 10 165 group5 10.000000 10.000000 10 10 group6 10.000000 172.983882 10 10 group7 10.000000 10.000000 10 10
Summary Operator DataFrame[string] DataFrame[integer] DataFrame[integer1:integer2] DataFrame.loc[x, y] DataFrame.iloc[x, y] DataFrame[Bedingung] Bedeutung Gibt die Spalte mit dem Label string zurück Gibt die Spalte an Posi on integer zurück Gibt einen Slice der Spalten von Posi on integer1 bis integer2 zurück Gibt die Spalten mit Labels in y aus den Zeilen mit Labels aus x zurück. Hierbei sind x und y entweder skalare Strings, Listen von Strings, oder Slices von Strings Gibt die Spalten an den Posi onen in y aus den Zeilen an den Posi onen aus x zurück. Hierbei sind x und y entweder skalare Integers, Listen von Integers, oder Slices von Integers Gibt komple en DataFrame zurück, setzt jedoch Werte im DataFrame, die die Bedingung nicht erfüllen auf NaN
pandas.datafram.apply
Method Specs pandas.dataframe.apply(func, axis=0, args=()) Argument Default Bedeutung func - Funk on, die auf Spalten, bzw. Zeilen angewandt wird axis 0 Auf welche Achse soll di Funk on angewandt werden (0:=Zeilen, 1:=Spalten) args () Posi onale Argumente der Funk on
Apply to rows In [15]: import pandas as pnd df = pnd.read_csv("data/male_heights.csv", sep="\t", index_col=0) def mean(x): return (sum(x)/len(x)) # Apply a function to rows row_means = df.apply(mean, axis=0) print row_means Individual1 187.242333 Individual2 183.566917 Individual3 187.254260 Individual4 183.432380 Individual5 183.033183 dtype: float64
Apply to columns In [18]: # Apply a function to columns col_means = df.apply(mean, axis=1) print col_means group0 179.019847 group1 182.668433 group2 183.105226 group3 178.387636 group4 184.637591 group5 191.227726 group6 191.560997 group7 188.639061 dtype: float64
Kombinieren von DataFrames
pandas.concat In [27]: import pandas as pnd df = pnd.read_csv("data/male_heights.csv", sep="\t", index_col=0) print df df_pieces = [df[:1], df[1:3], df[3:]] print pnd.concat(df_pieces) Individual1 Individual2 Individual3 Individual4 Individual5 group0 177.383691 187.517906 182.945059 178.020507 169.232070 group1 179.623133 178.138221 188.529097 185.312991 181.738726 group2 178.894888 182.964432 181.898760 183.025629 188.742420 group3 188.774235 179.007723 183.765283 168.775728 171.615213 group4 186.496567 193.176500 190.919088 176.547197 176.048604 group5 196.132422 183.594067 196.302713 184.401609 195.707818 group6 200.641538 172.983882 189.127611 190.970565 204.081392 group7 189.992194 191.152608 184.546473 200.404814 177.099218 Individual1 Individual2 Individual3 Individual4 Individual5 group0 177.383691 187.517906 182.945059 178.020507 169.232070 group1 179.623133 178.138221 188.529097 185.312991 181.738726 group2 178.894888 182.964432 181.898760 183.025629 188.742420 group3 188.774235 179.007723 183.765283 168.775728 171.615213 group4 186.496567 193.176500 190.919088 176.547197 176.048604 group5 196.132422 183.594067 196.302713 184.401609 195.707818 group6 200.641538 172.983882 189.127611 190.970565 204.081392 group7 189.992194 191.152608 184.546473 200.404814 177.099218
pandas.merge In [42]: import pandas as pnd left = pnd.dataframe(np.array([["foo", 1], ["foo", 2]]), columns=["key", "lval"]) print left key lval 0 foo 1 1 foo 2 In [43]: right = pnd.dataframe(np.array([["foo", 4], ["foo", 5]]), columns=["key", "rval"]) print right key rval 0 foo 4 1 foo 5 In [44]: left_right_merged = pnd.merge(left, right, on="key") print left_right_merged key lval rval 0 foo 1 4 1 foo 1 5 2 foo 2 4 3 foo 2 5
pandas.dataframe.append() In [45]: import pandas as pnd df = pnd.read_csv("data/male_heights.csv", sep="\t", index_col=0) print df Individual1 Individual2 Individual3 Individual4 Individual5 group0 177.383691 187.517906 182.945059 178.020507 169.232070 group1 179.623133 178.138221 188.529097 185.312991 181.738726 group2 178.894888 182.964432 181.898760 183.025629 188.742420 group3 188.774235 179.007723 183.765283 168.775728 171.615213 group4 186.496567 193.176500 190.919088 176.547197 176.048604 group5 196.132422 183.594067 196.302713 184.401609 195.707818 group6 200.641538 172.983882 189.127611 190.970565 204.081392 group7 189.992194 191.152608 184.546473 200.404814 177.099218 In [46]: s = df.iloc[7] print df.append(s, ignore_index=true) Individual1 Individual2 Individual3 Individual4 Individual5 0 177.383691 187.517906 182.945059 178.020507 169.232070 1 179.623133 178.138221 188.529097 185.312991 181.738726 2 178.894888 182.964432 181.898760 183.025629 188.742420 3 188.774235 179.007723 183.765283 168.775728 171.615213 4 186.496567 193.176500 190.919088 176.547197 176.048604 5 196.132422 183.594067 196.302713 184.401609 195.707818 6 200.641538 172.983882 189.127611 190.970565 204.081392 7 189.992194 191.152608 184.546473 200.404814 177.099218 8 189.992194 191.152608 184.546473 200.404814 177.099218
pandas.dataframe.plot()
Method Specs pandas.dataframe.plot(kind=plot_type) Die Plot Methode eines pandas DataFrames kann dazu benutzt werden um eine bes mmte Plotart für alle Spalten zu erzeugen. Das Schlüsselwortargument kind gibt hierbei an welche Art von Plot erzeugt werden soll.
Beispiel: Boxplots In [31]: import pandas as pnd %matplotlib inline df = pnd.read_csv("data/male_heights.csv", sep="\t", index_col=0) In [32]: Out[32]: # Plotte Boxplot für alle Gruppen (d.h. Zeilen) df.t.plot(kind="box") <matplotlib.axes._subplots.axessubplot at 0xad9c82cc>
Beispiel: Histogramme In [27]: import numpy as np import matplotlib.pyplot as plt df = pnd.dataframe(np.array([np.random.randn(1000), [ i + 10 for i in np.random.ran dn(1000) ]])) In [30]: fig = plt.figure(figsize=(10, 3)) # Plotte Histogram für die erste Zeile plt.subplot(1,2,1) df.iloc[0].t.plot(kind="hist") # Plotte Histogram für die zweite Zeile plt.subplot(1,2,2) df.iloc[1].t.plot(kind="hist") Out[30]: <matplotlib.axes._subplots.axessubplot at 0xadcfad2c>