Pandas - 聚合

  • 简述

    一旦创建rolling, expanding 和ewm创建对象后,有几种方法可用于对数据执行聚合。
  • 在 DataFrame 上应用聚合

    让我们创建一个 DataFrame 并在其上应用聚合。
    
    import pandas as pd
    import numpy as np
    df = pd.DataFrame(np.random.randn(10, 4),
       index = pd.date_range('1/1/2000', periods=10),
       columns = ['A', 'B', 'C', 'D'])
    print df
    r = df.rolling(window=3,min_periods=1)
    print r
    
    它的输出如下 -
    
                        A           B           C           D
    2000-01-01   1.088512   -0.650942   -2.547450   -0.566858
    2000-01-02   0.790670   -0.387854   -0.668132    0.267283
    2000-01-03  -0.575523   -0.965025    0.060427   -2.179780
    2000-01-04   1.669653    1.211759   -0.254695    1.429166
    2000-01-05   0.100568   -0.236184    0.491646   -0.466081
    2000-01-06   0.155172    0.992975   -1.205134    0.320958
    2000-01-07   0.309468   -0.724053   -1.412446    0.627919
    2000-01-08   0.099489   -1.028040    0.163206   -1.274331
    2000-01-09   1.639500   -0.068443    0.714008   -0.565969
    2000-01-10   0.326761    1.479841    0.664282   -1.361169
    Rolling [window=3,min_periods=1,center=False,axis=0]                
    
    我们可以通过将函数传递给整个 DataFrame 来进行聚合,或者通过标准选择一列get item方法。

    在整个数据框上应用聚合

    
    import pandas as pd
    import numpy as np
    df = pd.DataFrame(np.random.randn(10, 4),
       index = pd.date_range('1/1/2000', periods=10),
       columns = ['A', 'B', 'C', 'D'])
    print df
    r = df.rolling(window=3,min_periods=1)
    print r.aggregate(np.sum)
    
    它的输出如下 -
    
                        A           B           C           D
    2000-01-01   1.088512   -0.650942   -2.547450   -0.566858
    2000-01-02   1.879182   -1.038796   -3.215581   -0.299575
    2000-01-03   1.303660   -2.003821   -3.155154   -2.479355
    2000-01-04   1.884801   -0.141119   -0.862400   -0.483331
    2000-01-05   1.194699    0.010551    0.297378   -1.216695
    2000-01-06   1.925393    1.968551   -0.968183    1.284044
    2000-01-07   0.565208    0.032738   -2.125934    0.482797
    2000-01-08   0.564129   -0.759118   -2.454374   -0.325454
    2000-01-09   2.048458   -1.820537   -0.535232   -1.212381
    2000-01-10   2.065750    0.383357    1.541496   -3.201469
                        A           B           C           D
    2000-01-01   1.088512   -0.650942   -2.547450   -0.566858
    2000-01-02   1.879182   -1.038796   -3.215581   -0.299575
    2000-01-03   1.303660   -2.003821   -3.155154   -2.479355
    2000-01-04   1.884801   -0.141119   -0.862400   -0.483331
    2000-01-05   1.194699    0.010551    0.297378   -1.216695
    2000-01-06   1.925393    1.968551   -0.968183    1.284044
    2000-01-07   0.565208    0.032738   -2.125934    0.482797
    2000-01-08   0.564129   -0.759118   -2.454374   -0.325454
    2000-01-09   2.048458   -1.820537   -0.535232   -1.212381
    2000-01-10   2.065750    0.383357    1.541496   -3.201469
    

    在数据框的单个列上应用聚合

    
    import pandas as pd
    import numpy as np
    df = pd.DataFrame(np.random.randn(10, 4),
       index = pd.date_range('1/1/2000', periods=10),
       columns = ['A', 'B', 'C', 'D'])
    print df
    r = df.rolling(window=3,min_periods=1)
    print r['A'].aggregate(np.sum)
    
    它的输出如下 -
    
                     A           B           C           D
    2000-01-01   1.088512   -0.650942   -2.547450   -0.566858
    2000-01-02   1.879182   -1.038796   -3.215581   -0.299575
    2000-01-03   1.303660   -2.003821   -3.155154   -2.479355
    2000-01-04   1.884801   -0.141119   -0.862400   -0.483331
    2000-01-05   1.194699    0.010551    0.297378   -1.216695
    2000-01-06   1.925393    1.968551   -0.968183    1.284044
    2000-01-07   0.565208    0.032738   -2.125934    0.482797
    2000-01-08   0.564129   -0.759118   -2.454374   -0.325454
    2000-01-09   2.048458   -1.820537   -0.535232   -1.212381
    2000-01-10   2.065750    0.383357    1.541496   -3.201469
    2000-01-01   1.088512
    2000-01-02   1.879182
    2000-01-03   1.303660
    2000-01-04   1.884801
    2000-01-05   1.194699
    2000-01-06   1.925393
    2000-01-07   0.565208
    2000-01-08   0.564129
    2000-01-09   2.048458
    2000-01-10   2.065750
    Freq: D, Name: A, dtype: float64
    

    在 DataFrame 的多列上应用聚合

    
    import pandas as pd
    import numpy as np
    df = pd.DataFrame(np.random.randn(10, 4),
       index = pd.date_range('1/1/2000', periods=10),
       columns = ['A', 'B', 'C', 'D'])
    print df
    r = df.rolling(window=3,min_periods=1)
    print r[['A','B']].aggregate(np.sum)
    
    它的输出如下 -
    
                     A           B           C           D
    2000-01-01   1.088512   -0.650942   -2.547450   -0.566858
    2000-01-02   1.879182   -1.038796   -3.215581   -0.299575
    2000-01-03   1.303660   -2.003821   -3.155154   -2.479355
    2000-01-04   1.884801   -0.141119   -0.862400   -0.483331
    2000-01-05   1.194699    0.010551    0.297378   -1.216695
    2000-01-06   1.925393    1.968551   -0.968183    1.284044
    2000-01-07   0.565208    0.032738   -2.125934    0.482797
    2000-01-08   0.564129   -0.759118   -2.454374   -0.325454
    2000-01-09   2.048458   -1.820537   -0.535232   -1.212381
    2000-01-10   2.065750    0.383357    1.541496   -3.201469
                        A           B
    2000-01-01   1.088512   -0.650942
    2000-01-02   1.879182   -1.038796
    2000-01-03   1.303660   -2.003821
    2000-01-04   1.884801   -0.141119
    2000-01-05   1.194699    0.010551
    2000-01-06   1.925393    1.968551
    2000-01-07   0.565208    0.032738
    2000-01-08   0.564129   -0.759118
    2000-01-09   2.048458   -1.820537
    2000-01-10   2.065750    0.383357
    

    在 DataFrame 的单列上应用多个函数

    
    import pandas as pd
    import numpy as np
    df = pd.DataFrame(np.random.randn(10, 4),
       index = pd.date_range('1/1/2000', periods=10),
       columns = ['A', 'B', 'C', 'D'])
    print df
    r = df.rolling(window=3,min_periods=1)
    print r['A'].aggregate([np.sum,np.mean])
    
    它的输出如下 -
    
                     A           B           C           D
    2000-01-01   1.088512   -0.650942   -2.547450   -0.566858
    2000-01-02   1.879182   -1.038796   -3.215581   -0.299575
    2000-01-03   1.303660   -2.003821   -3.155154   -2.479355
    2000-01-04   1.884801   -0.141119   -0.862400   -0.483331
    2000-01-05   1.194699    0.010551    0.297378   -1.216695
    2000-01-06   1.925393    1.968551   -0.968183    1.284044
    2000-01-07   0.565208    0.032738   -2.125934    0.482797
    2000-01-08   0.564129   -0.759118   -2.454374   -0.325454
    2000-01-09   2.048458   -1.820537   -0.535232   -1.212381
    2000-01-10   2.065750    0.383357    1.541496   -3.201469
                      sum       mean
    2000-01-01   1.088512   1.088512
    2000-01-02   1.879182   0.939591
    2000-01-03   1.303660   0.434553
    2000-01-04   1.884801   0.628267
    2000-01-05   1.194699   0.398233
    2000-01-06   1.925393   0.641798
    2000-01-07   0.565208   0.188403
    2000-01-08   0.564129   0.188043
    2000-01-09   2.048458   0.682819
    2000-01-10   2.065750   0.688583
    

    在 DataFrame 的多列上应用多个函数

    
    import pandas as pd
    import numpy as np
    df = pd.DataFrame(np.random.randn(10, 4),
       index = pd.date_range('1/1/2000', periods=10),
       columns = ['A', 'B', 'C', 'D'])
    print df
    r = df.rolling(window=3,min_periods=1)
    print r[['A','B']].aggregate([np.sum,np.mean])
    
    它的输出如下 -
    
                     A           B           C           D
    2000-01-01   1.088512   -0.650942   -2.547450   -0.566858
    2000-01-02   1.879182   -1.038796   -3.215581   -0.299575
    2000-01-03   1.303660   -2.003821   -3.155154   -2.479355
    2000-01-04   1.884801   -0.141119   -0.862400   -0.483331
    2000-01-05   1.194699    0.010551    0.297378   -1.216695
    2000-01-06   1.925393    1.968551   -0.968183    1.284044
    2000-01-07   0.565208    0.032738   -2.125934    0.482797
    2000-01-08   0.564129   -0.759118   -2.454374   -0.325454
    2000-01-09   2.048458   -1.820537   -0.535232   -1.212381
    2000-01-10   2.065750    0.383357    1.541496   -3.201469
                        A                      B
                      sum       mean         sum        mean
    2000-01-01   1.088512   1.088512   -0.650942   -0.650942
    2000-01-02   1.879182   0.939591   -1.038796   -0.519398
    2000-01-03   1.303660   0.434553   -2.003821   -0.667940
    2000-01-04   1.884801   0.628267   -0.141119   -0.047040
    2000-01-05   1.194699   0.398233    0.010551    0.003517
    2000-01-06   1.925393   0.641798    1.968551    0.656184
    2000-01-07   0.565208   0.188403    0.032738    0.010913
    2000-01-08   0.564129   0.188043   -0.759118   -0.253039
    2000-01-09   2.048458   0.682819   -1.820537   -0.606846
    2000-01-10   2.065750   0.688583    0.383357    0.127786
    

    将不同的函数应用于数据框的不同列

    
    import pandas as pd
    import numpy as np
     
    df = pd.DataFrame(np.random.randn(3, 4),
       index = pd.date_range('1/1/2000', periods=3),
       columns = ['A', 'B', 'C', 'D'])
    print df
    r = df.rolling(window=3,min_periods=1)
    print r.aggregate({'A' : np.sum,'B' : np.mean})
    
    它的输出如下 -
    
                        A          B          C         D
    2000-01-01  -1.575749  -1.018105   0.317797  0.545081
    2000-01-02  -0.164917  -1.361068   0.258240  1.113091
    2000-01-03   1.258111   1.037941  -0.047487  0.867371
                        A          B
    2000-01-01  -1.575749  -1.018105
    2000-01-02  -1.740666  -1.189587
    2000-01-03  -0.482555  -0.447078