pandas查看数据例子

pandas有很多有用的函数查看数据的列,行。官网例子是英文的,这里翻译成口水话,方便大家理解。

首先创建一个pandas的DataFrame数据集,方便举例说明。

In [6]: dates = pd.date_range(‘20130101’, periods=6)

In [7]: dates
Out[7]:
DatetimeIndex([‘2013-01-01’, ‘2013-01-02’, ‘2013-01-03’, ‘2013-01-04’,
‘2013-01-05’, ‘2013-01-06′],
dtype=’datetime64[ns]’, freq=’D’)

In [8]: df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list(‘ABCD’))

In [9]: df
Out[9]:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
2013-01-06 -0.673690 0.113648 -1.478427 0.524988

In [14]: df.head() #head方法用来查看前几行,参数输入数字20,就是看前20行,如果没有输入,缺省前5行。
Out[14]:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401

In [15]: df.tail(3)#tail方法看最后几行,使用方法与head类似。
Out[15]:
A B C D
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
2013-01-06 -0.673690 0.113648 -1.478427 0.524988
Display the index, columns, and the underlying NumPy data:

In [16]: df.index #index属性是查看索引。
Out[16]:
DatetimeIndex([‘2013-01-01’, ‘2013-01-02’, ‘2013-01-03’, ‘2013-01-04’,
‘2013-01-05’, ‘2013-01-06′],
dtype=’datetime64[ns]’, freq=’D’)

In [17]: df.columns #columns属性查看列名
Out[17]: Index([‘A’, ‘B’, ‘C’, ‘D’], dtype=’object’)

In [18]: df.values #values属性列出数据
Out[18]:
array([[ 0.4691, -0.2829, -1.5091, -1.1356],
[ 1.2121, -0.1732, 0.1192, -1.0442],
[-0.8618, -2.1046, -0.4949, 1.0718],
[ 0.7216, -0.7068, -1.0396, 0.2719],
[-0.425 , 0.567 , 0.2762, -1.0874],
[-0.6737, 0.1136, -1.4784, 0.525 ]])
describe() shows a quick statistic summary of your data:

In [19]: df.describe() #列出数据的简要统计,包括count 数量,mean 平均值,std 标准差,min 最小值,25% 第一四分位数 (Q1),又称“较小四分位数”,等于该样本中所有数值由小到大排列后第25%的数字。,50% 中位数,75% 较大四分位数,max 最大值
Out[19]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.073711 -0.431125 -0.687758 -0.233103
std 0.843157 0.922818 0.779887 0.973118
min -0.861849 -2.104569 -1.509059 -1.135632
25% -0.611510 -0.600794 -1.368714 -1.076610
50% 0.022070 -0.228039 -0.767252 -0.386188
75% 0.658444 0.041933 -0.034326 0.461706
max 1.212112 0.567020 0.276232 1.071804
Transposing your data:

In [20]: df.T #列变行,行变列
Out[20]:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 0.469112 1.212112 -0.861849 0.721555 -0.424972 -0.673690
B -0.282863 -0.173215 -2.104569 -0.706771 0.567020 0.113648
C -1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427
D -1.135632 -1.044236 1.071804 0.271860 -1.087401 0.524988
Sorting by an axis:

In [21]: df.sort_index(axis=1, ascending=False) #参数axis有两个值,分别是0和1,而df中有两个index分别是表最左一列的时间和表最上一行的ABCDE,而axis=0对应的是对左边一列的index进行排序,ascending=False代表降序,ascending=True代表升序,若运行sort_index(axis=0,ascending=False)后,最左边的时间列呈降序排列。axis=1对应的是对上边一行的index进行排序,同样的,ascending=False代表降序,ascending=True代表升序,这里sort_index(axis=1,ascending=False)的意思就是用最上边的ABCDE行呈降序排列,输出结果如下:

Out[21]:
D C B A
2013-01-01 -1.135632 -1.509059 -0.282863 0.469112
2013-01-02 -1.044236 0.119209 -0.173215 1.212112
2013-01-03 1.071804 -0.494929 -2.104569 -0.861849
2013-01-04 0.271860 -1.039575 -0.706771 0.721555
2013-01-05 -1.087401 0.276232 0.567020 -0.424972
2013-01-06 0.524988 -1.478427 0.113648 -0.673690
Sorting by values:

In [22]: df.sort_values(by=’B’) #这个比较好理解,按照值排序,这里的B是指定要排序的行。
Out[22]:
A B C D
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-06 -0.673690 0.113648 -1.478427 0.524988
2013-01-05 -0.424972 0.567020 0.276232 -1.087401


发表评论

电子邮件地址不会被公开。 必填项已用*标注