数据分析工具PANDAS技巧：如何过滤数据

在本文中，我们将介绍在Python中过滤pandas数据帧的各种方法。数据过滤是最常见的数据操作操作之一。它类似于SQL中的WHERE子句，或者必须在MS Excel中使用过滤器根据某些条件选择特定行。就速度而言，python执行过滤和聚合更佳。它有很棒的库：pandas。 Pandas是在numpy包之上构建的，它是用C语言编写的，这是一种低级语言。因此，使用pandas包进行数据操作是处理大型数据集的快速而智能的方法。

数据过滤的示例

它是预测建模或任何报告项目的数据准备的最初步骤之一。它也被称为“子集数据”。请参阅下面的一些数据过滤示例。

选择在2019年1月1日之后开立帐户的所有活跃客户
提取过去6个月内进行超过3笔交易的所有客户的详细信息
获取在组织中工作超过3年且在过去两年中获得最高评级的员工的信息
分析投诉数据并确定在过去1年内提交超过5个投诉的客户
提取人均收入超过40K美元的地铁城市的详细信息

导入数据

我们将使用包含2013年从纽约出发的航班详情的数据集。该数据集有32735行和16列。下载 itbooks.pipipan.com/fs/18113597…。

表头如下：

['year', 'month', 'day', 'dep_time', 'dep_delay', 'arr_time', 'arr_delay', 'carrier', 'tailnum', 'flight', 'origin', 'dest', 'air_time', 'distance', 'hour', 'minute']

导入数据

import pandas as pd
df = pd.read_csv("nycflights.csv")

使用列值过滤

选择JetBlue Airways航班详细信息，其中包含2个字母的运营商代码B6 ，起源于JFK机场。

方法1：DataFrame方式

>>> newdf = df[(df.origin == "JFK") & (df.carrier == "B6")]
>>> newdf.head()
    year  month  day  dep_time  dep_delay  arr_time  arr_delay carrier tailnum  flight origin dest  air_time  distance  hour  minute
7   2013      8   13      1920       85.0      2032       71.0      B6  N284JB    1407    JFK  IAD      48.0     228.0  19.0    20.0
10  2013      6   17       940        5.0      1050       -4.0      B6  N351JB      20    JFK  ROC      50.0     264.0   9.0    40.0
14  2013     10   21      1217       -4.0      1322       -6.0      B6  N192JB      34    JFK  BTV      46.0     266.0  12.0    17.0
23  2013      7    7      2310      105.0       201      127.0      B6  N506JB      97    JFK  DEN     223.0    1626.0  23.0    10.0
35  2013      4   12       840       20.0      1240       28.0      B6  N655JB     403    JFK  SJU     186.0    1598.0   8.0    40.0

这部分代码(df.origin == "JFK") & (df.carrier == "B6")返回True / False。条件匹配时为真，条件不匹配时为假。稍后它在df内传递并返回与True对应的所有行。它返回4166行。

方法2：查询函数

在pandas包中，有多种方法可以执行过滤。上面的代码也可以像下面显示的代码一样编写。此方法更优雅，更易读，每次指定列（变量）时都不需要提及数据框名称。

>>> newdf = df.query('origin == "JFK" & carrier == "B6"')

方法3：loc函数

loc是位置术语的缩写。所有这三种方法都返回相同的输出。这只是一种不同的过滤行的方法。

>>> newdf = df.loc[(df.origin == "JFK") & (df.carrier == "B6")]

按行和列位置过滤Pandas数据帧

假设您想按位置选择特定的行（假设从第二行到第五行）。我们可以使用df.iloc[ ]函数。
python中的索引从零开始。 df.iloc [0：5，]指第一至第五行（此处不包括终点第6行）。 df.iloc [0：5，]相当于df.iloc [：5，]

df.iloc[:5,] #First 5 rows
df.iloc[1:5,] #Second to Fifth row
df.iloc[5,0] #Sixth row and 1st column
df.iloc[1:5,0] #Second to Fifth row, first column
df.iloc[1:5,:5] #Second to Fifth row, first 5 columns
df.iloc[2:7,1:3] #Third to Seventh row, 2nd and 3rd column

loc根据索引标签考虑行。而iloc根据索引中的位置考虑行，因此它只需要整数。让我们创建一个示例数据进行说明

>>> x
  col1
9     1
8     3
7     5
6     7
0     9
1    11
2    13
3    15
4    17
5    19
>>> x.iloc[0:5]
  col1
9     1
8     3
7     5
6     7
0     9
>>> x.loc[0:5]
  col1
0     9
1    11
2    13
3    15
4    17
5    19

参考资料

按行位置和列名称过滤pandas数据帧

>>> df.loc[df.index[0:5],["origin","dest"]]
  origin dest
0    JFK  LAX
1    JFK  SJU
2    JFK  LAX
3    JFK  TPA
4    LGA  ORF
# -   讨论qq群630011153 144081101

列中选择多个值

>>> newdf = df[df.origin.isin(["JFK", "LGA"])]

不等于

>>> newdf = df.loc[(df.origin != "JFK") & (df.carrier == "B6")]
>>> pd.unique(newdf.origin)
array(['LGA', 'EWR'], dtype=object)

如何否定整个条件

>>> newdf = df[~((df.origin == "JFK") & (df.carrier == "B6"))]

选择非缺失数据

>>> newdf = df[~((df.origin == "JFK") & (df.carrier == "B6"))]

过滤Pandas Dataframe中的字符串

>>> df = pd.DataFrame({"var1": ["AA_2", "B_1", "C_2", "A_2"]})
>>> df
   var1
0  AA_2
1   B_1
2   C_2
3   A_2
>>> df[df['var1'].str[0] == 'A']
   var1
0  AA_2
3   A_2
>>> df[df['var1'].str.len()>3]
   var1
0  AA_2
>>> df[df['var1'].str.contains('A|B')]
   var1
0  AA_2
1   B_1
3   A_2

文章来自阿里云开发者社区

原文链接：developer.aliyun.com/article/715…