十分钟搞定Pandas

发布 : 2016-05-22 分类 : 大数据 浏览 :

1.创建对象

1
1.可以通过传递一个list对象来创建一个Series,pandas会默认创建整型索引
1
2
3
4
5
6
7
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

Markdown

1
2.通过传递一个numpy array,时间索引以及列表标签来创建一个DataFrame
1
2
3
4
5
6
7
8
9
10
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
print(dates)

df = pd.DataFrame(np.random.rand(6,4),index=dates,columns=list('ABCD'))
print(df)

Markdown

1
3.通过传递一个能够被转换成类似序列结构的字典对象来创建一个DataFrame
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
print(dates)

df = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20170206'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'})
print(df)

Markdown

1
4.查看不同列的数据类型
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
# print(dates)

df = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20170206'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'})
# print(df)
print(df.dtypes)

Markdown

1
5.如果你是用的是python,使用Tab自动补全功能会自动识别所有的属性以及自定义的列,下面是所有能够被自动识别属性的一个子集
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
# print(dates)

df = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20170206'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'})
# print(df)
print(df.dtypes)
# print(df.<TAB>)

2.查看数据

1
1.查看frame中头部和尾部的行
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
# print(dates)

df = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20170206'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'})
print(df.head())
print(df.tail(3))

Markdown

1
2.显示索引、列和底层的numpy数据
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20170206'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'})

print("显示索引:",df.index)
print("显示列:",df.columns)
print("显示底层:",df.values)

Markdown

1
3.describe()函数对于数据快速统计汇总
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20170206'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'})

print("对数据进行快速统计汇总:",df.describe())

Markdown

1
4.对数据的转置
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20170206'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'})

print("对数据的转置:",df.T)

Markdown

1
5.按轴进行排序
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20170206'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'})

print("按轴进行排序:",df.sort_index(axis=1,ascending=False))

Markdown

1
6.按值进行排序
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6,4),index=dates,columns=list('ABCD'))

print("按值进行排序:",df.sort(columns='B'))

Markdown

3.选择

1
虽然标准的Python/Numpy的选择和设置表达式都能够直接派上用场,但是作为工程使用的代码,推荐使用经过优化的pandas数据访问范式:.at,.iat,.loc,.iloc和.ix

3.1.获取

1
2
1.选择一个单独的列,这返回一个Series,等同于
df.A
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6,4),index=dates,columns=list('ABCD'))

print("选择一个单独的列,这将返回一个Series:\n",df['A'])

Markdown

1
2.通过[]进行选择,这将会对行进行切片
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6,4),index=dates,columns=list('ABCD'))

print("通过[]进行选择,这将会对行进行切片:\n",df[0:3])

Markdown

3.2.通过标签选择

1
1.使用标签来获取一个交叉的区域
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6,4),index=dates,columns=list('ABCD'))

print("使用标签来获取一个交叉的区域:\n",df.loc[dates[0]])

Markdown

1
2.通过标签来在多个轴上进行选择
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6, 4), index=dates, columns=list('ABCD'))

print("通过标签来在多个轴上进行选择:\n", df.loc[:, ['A','B']])

Markdown

1
3.标签切片
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6, 4), index=dates, columns=list('ABCD'))

print("标签切片:\n", df.loc['20170206':'20170209', ['A','B']])

Markdown

1
4.对于返回的对象进行维度缩减
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6, 4), index=dates, columns=list('ABCD'))

print("对于返回的对象进行维度缩减:\n", df.loc['20170209', ['A','B']])

Markdown

1
5.获取一个标量
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6, 4), index=dates, columns=list('ABCD'))

print("获取一个标量:\n", df.loc[dates[0],'A'])

Markdown

1
6.快速访问一个标量
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6, 4), index=dates, columns=list('ABCD'))

print("获取一个标量:\n", df.at[dates[0],'A'])

Markdown

3.3.通过位置选择

1
1.通过传递数值进行位置选择(选择的是行)
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6, 4), index=dates, columns=list('ABCD'))

print("通过传递数值进行位置选择:\n", df.iloc[3])

Markdown

1
2.通过数值进行切片
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6, 4), index=dates, columns=list('ABCD'))

print("通过数值进行切片,与numpy/python中的情况类似:\n", df.iloc[3:5,0:2])

Markdown

1
3.通过指定一个位置的列表,与numpy/python中的情况类似
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6, 4), index=dates, columns=list('ABCD'))

print("通过指定一个位置的列表,与numpy/python中的情况类似:\n", df.iloc[[1, 2, 4], [0, 2]])

Markdown

1
4.对行进行切片
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6, 4), index=dates, columns=list('ABCD'))

print("对行进行切片:\n", df.iloc[1:3,:])

Markdown

1
5.对列进行切片
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6, 4), index=dates, columns=list('ABCD'))

print("对行进行切片:\n", df.iloc[:,1:3])

Markdown

1
6.获取特定的值
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6, 4), index=dates, columns=list('ABCD'))

print("对行进行切片:\n", df.iloc[1,1])

Markdown

3.3.布尔索引

1
1.使用一个单独列的值来选择数据
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6, 4), index=dates, columns=list('ABCD'))

print("使用一个单独列的值来选择数据:\n", df[df.A > 0])

Markdown

1
2.使用where操作来选择数据
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6, 4), index=dates, columns=list('ABCD'))

print("使用where操作来选择数据:\n", df[df > 0])

Markdown

1
3.使用isin()方法来过滤
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6, 4), index=dates, columns=list('ABCD'))

print("使用isin()方法来过滤:\n", df.copy())

Markdown

3.4.设置

1
1.设置一个新的列
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6, 4), index=dates, columns=list('ABCD'))

print("设置一个新的列:\n", pd.Series([1,2,3,4,5,6],index=pd.date_range('20170205',periods=6)))

Markdown

1
2.通过标签设置新的值
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6, 4), index=dates, columns=list('ABCD'))
df.at[dates[0], 'A'] = 0
print("通过标签设置新的值:\n", df)

Markdown

1
3.通过位置设置新的值
1
2
3
4
5
6
7
8
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6, 4), index=dates, columns=list('ABCD'))
print("通过标签设置新的值:\n", df.iat[0,1])

Markdown

1
4.通过一个numpy数值设置一组新值
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6, 4), index=dates, columns=list('ABCD'))
df.loc[:, 'D'] = np.array([5] * len(df))
print("通过一个numpy数组设置一组新值:\n", df)

Markdown

1
5.通过where操作来设置新的值
1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dates = pd.date_range('20170205', periods=6)
df = pd.DataFrame(np.random.rand(6, 4), index=dates, columns=list('ABCD'))
df2 = df.copy()
print("通过where操作来设置新的值:\n", df2[df2 > 0])

4.缺失值处理

1
在pandas中,使用np.nan来代替缺失值,这些值默认不会包含在计算中
1
1.reindex()方法可以对执行轴上的索引进行改变/增加/删除操作,这将返回原始数据的一个拷贝
1
2.去掉包含缺失值的行
1
3.对缺失值进行填充
1
4.对数据进行布尔填充

5.统计

6.合并

7.分组

8.REshaping

9.时间序列

10.Categorical

11.画图

12.导入和保存数据

本文作者 : Matrix
原文链接 : https://matrixsparse.github.io/2016/05/22/十分钟搞定Pandas/
版权声明 : 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明出处!

知识 & 情怀 | 二者兼得

微信扫一扫, 向我投食

微信扫一扫, 向我投食

支付宝扫一扫, 向我投食

支付宝扫一扫, 向我投食

留下足迹