探索 Iris 鸢尾花数据

1 2	import pandas as pd import numpy as np

读取数据集

1 2	iris = pd.read_csv('iris.csv', header=None) iris.head()

数据集下载

	0	1	2	3
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

修改数据框列名称

1	iris.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']

1	iris.head()

	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

数据框中有缺失值吗？

1
2
3

iris.isna().sum().sum()
# 得到的结果为0
# 说明：数据框中是不存在缺失值

将列petal_length的第十到十九行设置为缺失值

1 2	# 使用默认索引 iris.iloc[9:19, 2] = np.nan

1	iris.iloc[9:19, 2]

9    NaN
10   NaN
11   NaN
12   NaN
13   NaN
14   NaN
15   NaN
16   NaN
17   NaN
18   NaN
Name: petal_length, dtype: float64

1	iris.isna().sum().sum()

1 2	# 使用自定义索引进行取值 iris.loc[9:18, 'petal_length']

9    NaN
10   NaN
11   NaN
12   NaN
13   NaN
14   NaN
15   NaN
16   NaN
17   NaN
18   NaN
Name: petal_length, dtype: float64

将petal_length缺失值全部替换为1.0

1	iris.fillna(value=1.0, inplace=True)

1	iris.iloc[9:19, 2]

9     1.0
10    1.0
11    1.0
12    1.0
13    1.0
14    1.0
15    1.0
16    1.0
17    1.0
18    1.0
Name: petal_length, dtype: float64

删除列class

1	iris.drop(['class'], axis=1, inplace=True)

1	iris.head()

	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

将数据框前三行设置为缺失值

1	iris.loc[:2, :] = np.nan

1	iris.head()

	sepal_length	sepal_width	petal_length	petal_width
0	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

删除有缺失值的行

1	iris.dropna(inplace=True)

1	iris.head()

	sepal_length	sepal_width	petal_length	petal_width
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2
5	5.4	3.9	1.7	0.4
6	4.6	3.4	1.4	0.3
7	5.0	3.4	1.5	0.2

重新设置索引

# 调用reset_index
# drop: 是否将原来的行索引删除
# 默认情况下会把原来的行索引组一列新列，放到数据框的第一列
iris.reset_index(drop=True, inplace=True)

1	iris.head()

	sepal_length	sepal_width	petal_length	petal_width
0	4.6	3.1	1.5	0.2
1	5.0	3.6	1.4	0.2
2	5.4	3.9	1.7	0.4
3	4.6	3.4	1.4	0.3
4	5.0	3.4	1.5	0.2

1 2	iris_new = iris.iloc[2:, :] iris_new.head()

	sepal_length	sepal_width	petal_length	petal_width
2	5.4	3.9	1.7	0.4
3	4.6	3.4	1.4	0.3
4	5.0	3.4	1.5	0.2
5	4.4	2.9	1.4	0.2
6	4.9	3.1	1.0	0.1

1 2	# 对数据框的index属性进行重新赋值 iris_new.index = range(len(iris_new))

1	iris_new.head()

	sepal_length	sepal_width	petal_length	petal_width
0	5.4	3.9	1.7	0.4
1	4.6	3.4	1.4	0.3
2	5.0	3.4	1.5	0.2
3	4.4	2.9	1.4	0.2
4	4.9	3.1	1.0	0.1

探索Chipotle快餐数据

数据集下载

数据读取

1 2	chipo = pd.read_csv('chipotle.tsv', delimiter='\t') chipo.head()

	order_id	quantity	item_name	choice_description	item_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	$2.39
1	1	1	Izze	[Clementine]	$3.39
2	1	1	Nantucket Nectar	[Apple]	$3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	$2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	$16.98

查看前10行内容

1	chipo.head(10)

	order_id	quantity	item_name	choice_description	item_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	$2.39
1	1	1	Izze	[Clementine]	$3.39
2	1	1	Nantucket Nectar	[Apple]	$3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	$2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	$16.98
5	3	1	Chicken Bowl	[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...	$10.98
6	3	1	Side of Chips	NaN	$1.69
7	4	1	Steak Burrito	[Tomatillo Red Chili Salsa, [Fajita Vegetables...	$11.75
8	4	1	Steak Soft Tacos	[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...	$9.25
9	5	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...	$9.25

数据集中有多少个列(columns)

1	chipo.shape

(4622, 5)

1 2	# 使用dataframe的shape属性 index_num, columns_num = chipo.shape # 元组的解包

1	print(f'chipo数据框的行数为{index_num}，列数为{columns_num}')

chipo数据框的行数为4622， 列数为5

打印出全部的列名称

1	chipo.columns

Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')

1 2	# 查看列数的第二种方法 len(chipo.columns)

数据集的索引是怎样的

1	chipo.index

RangeIndex(start=0, stop=4622, step=1)

被下单数最多商品(item)是什么?

1
2
3

# 使用分组聚合求出每个商品被下单的总数
item_all_quantity = chipo.groupby('item_name').agg({'quantity': 'sum'})['quantity']
item_all_quantity[:5]

item_name
6 Pack Soft Drink        55
Barbacoa Bowl            66
Barbacoa Burrito         91
Barbacoa Crispy Tacos    12
Barbacoa Salad Bowl      10
Name: quantity, dtype: int64

1	item_all_quantity[item_all_quantity == item_all_quantity.max()]

item_name
Chicken Bowl    761
Name: quantity, dtype: int64

在item_name这一列中，一共有多少种商品被下单？

1 2	# 1. 使用集合 len(set(chipo['item_name']))

1 2	# 2. 使用numpy中的unique方法 len(np.unique(chipo['item_name']))

1 2	# 3. 使用dataframe的unique方法 len(chipo['item_name'].unique())

一共有多少个商品被下单？

1	chipo['quantity'].sum()

将item_price转换为浮点数

1 2	exp_str = chipo['item_price'][0] exp_str

'$2.39 '

1 2	def trans_float(str_): return float(str_.replace('$', '').strip())

1 2	# apply: 默认是对数据框中的某一列做相同的函数操作 chipo['item_price'] = chipo['item_price'].apply(trans_float)

1	chipo.head()

	order_id	quantity	item_name	choice_description	item_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	2.39
1	1	1	Izze	[Clementine]	3.39
2	1	1	Nantucket Nectar	[Apple]	3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	16.98

在该数据集对应的时期内，收入(revenue)是多少？

1	chipo['revenue'] = chipo['quantity'] * chipo['item_price']

1	chipo.head()

	order_id	quantity	item_name	choice_description	item_price	revenue
0	1	1	Chips and Fresh Tomato Salsa	NaN	2.39	2.39
1	1	1	Izze	[Clementine]	3.39	3.39
2	1	1	Nantucket Nectar	[Apple]	3.39	3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	2.39	2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	16.98	33.96

1	chipo['revenue'].sum()

39237.02

在该数据集对应的时期内，一共有多少订单？

1	chipo['order_id'].max()

1	len(chipo['order_id'].unique())

平均每一单(order)对应的总价是多少？

1	chipo.groupby('order_id').agg({'revenue': 'mean'})

	revenue
order_id
1	2.890000
2	33.960000
3	6.335000
4	10.500000
5	6.850000
...	...
1830	11.500000
1831	4.300000
1832	6.600000
1833	11.750000
1834	9.583333

1834 rows × 1 columns

Ming-Log's Blog

Python数据科学_5_Python数据分析实训

探索 Iris 鸢尾花数据

读取数据集

修改数据框列名称

数据框中有缺失值吗？

将列petal_length的第十到十九行设置为缺失值

将petal_length缺失值全部替换为1.0

删除列class

将数据框前三行设置为缺失值

删除有缺失值的行

重新设置索引

探索Chipotle快餐数据

数据读取

查看前10行内容

数据集中有多少个列(columns)

打印出全部的列名称

数据集的索引是怎样的

被下单数最多商品(item)是什么?

在item_name这一列中，一共有多少种商品被下单？

一共有多少个商品被下单？

将item_price转换为浮点数

在该数据集对应的时期内，收入(revenue)是多少？

在该数据集对应的时期内，一共有多少订单？

平均每一单(order)对应的总价是多少？

	0	1	2	3
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal_length	sepal_width	petal_length	petal_width
0	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal_length	sepal_width	petal_length	petal_width
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2
5	5.4	3.9	1.7	0.4
6	4.6	3.4	1.4	0.3
7	5.0	3.4	1.5	0.2

	sepal_length	sepal_width	petal_length	petal_width
0	4.6	3.1	1.5	0.2
1	5.0	3.6	1.4	0.2
2	5.4	3.9	1.7	0.4
3	4.6	3.4	1.4	0.3
4	5.0	3.4	1.5	0.2

	sepal_length	sepal_width	petal_length	petal_width
2	5.4	3.9	1.7	0.4
3	4.6	3.4	1.4	0.3
4	5.0	3.4	1.5	0.2
5	4.4	2.9	1.4	0.2
6	4.9	3.1	1.0	0.1

	sepal_length	sepal_width	petal_length	petal_width
0	5.4	3.9	1.7	0.4
1	4.6	3.4	1.4	0.3
2	5.0	3.4	1.5	0.2
3	4.4	2.9	1.4	0.2
4	4.9	3.1	1.0	0.1

	0	1	2	3
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal_length	sepal_width	petal_length	petal_width
0	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal_length	sepal_width	petal_length	petal_width
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2
5	5.4	3.9	1.7	0.4
6	4.6	3.4	1.4	0.3
7	5.0	3.4	1.5	0.2

	sepal_length	sepal_width	petal_length	petal_width
0	4.6	3.1	1.5	0.2
1	5.0	3.6	1.4	0.2
2	5.4	3.9	1.7	0.4
3	4.6	3.4	1.4	0.3
4	5.0	3.4	1.5	0.2

	sepal_length	sepal_width	petal_length	petal_width
2	5.4	3.9	1.7	0.4
3	4.6	3.4	1.4	0.3
4	5.0	3.4	1.5	0.2
5	4.4	2.9	1.4	0.2
6	4.9	3.1	1.0	0.1

	sepal_length	sepal_width	petal_length	petal_width
0	5.4	3.9	1.7	0.4
1	4.6	3.4	1.4	0.3
2	5.0	3.4	1.5	0.2
3	4.4	2.9	1.4	0.2
4	4.9	3.1	1.0	0.1

	0	1	2	3
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal_length	sepal_width	petal_length	petal_width
0	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal_length	sepal_width	petal_length	petal_width
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2
5	5.4	3.9	1.7	0.4
6	4.6	3.4	1.4	0.3
7	5.0	3.4	1.5	0.2

	sepal_length	sepal_width	petal_length	petal_width
0	4.6	3.1	1.5	0.2
1	5.0	3.6	1.4	0.2
2	5.4	3.9	1.7	0.4
3	4.6	3.4	1.4	0.3
4	5.0	3.4	1.5	0.2

	sepal_length	sepal_width	petal_length	petal_width
2	5.4	3.9	1.7	0.4
3	4.6	3.4	1.4	0.3
4	5.0	3.4	1.5	0.2
5	4.4	2.9	1.4	0.2
6	4.9	3.1	1.0	0.1

	sepal_length	sepal_width	petal_length	petal_width
0	5.4	3.9	1.7	0.4
1	4.6	3.4	1.4	0.3
2	5.0	3.4	1.5	0.2
3	4.4	2.9	1.4	0.2
4	4.9	3.1	1.0	0.1