探索 Iris 鸢尾花数据
1 2
| import pandas as pd import numpy as np
|
读取数据集
1 2
| iris = pd.read_csv('iris.csv', header=None) iris.head()
|
数据集下载
|
0 |
1 |
2 |
3 |
4 |
| 0 |
5.1 |
3.5 |
1.4 |
0.2 |
0 |
| 1 |
4.9 |
3.0 |
1.4 |
0.2 |
0 |
| 2 |
4.7 |
3.2 |
1.3 |
0.2 |
0 |
| 3 |
4.6 |
3.1 |
1.5 |
0.2 |
0 |
| 4 |
5.0 |
3.6 |
1.4 |
0.2 |
0 |
修改数据框列名称
1
| iris.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
|
|
sepal_length |
sepal_width |
petal_length |
petal_width |
class |
| 0 |
5.1 |
3.5 |
1.4 |
0.2 |
0 |
| 1 |
4.9 |
3.0 |
1.4 |
0.2 |
0 |
| 2 |
4.7 |
3.2 |
1.3 |
0.2 |
0 |
| 3 |
4.6 |
3.1 |
1.5 |
0.2 |
0 |
| 4 |
5.0 |
3.6 |
1.4 |
0.2 |
0 |
数据框中有缺失值吗?
1 2 3
| iris.isna().sum().sum()
|
0
将列petal_length的第十到十九行设置为缺失值
1 2
| iris.iloc[9:19, 2] = np.nan
|
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
Name: petal_length, dtype: float64
10
1 2
| iris.loc[9:18, 'petal_length']
|
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
Name: petal_length, dtype: float64
将petal_length缺失值全部替换为1.0
1
| iris.fillna(value=1.0, inplace=True)
|
9 1.0
10 1.0
11 1.0
12 1.0
13 1.0
14 1.0
15 1.0
16 1.0
17 1.0
18 1.0
Name: petal_length, dtype: float64
删除列class
1
| iris.drop(['class'], axis=1, inplace=True)
|
|
sepal_length |
sepal_width |
petal_length |
petal_width |
| 0 |
5.1 |
3.5 |
1.4 |
0.2 |
| 1 |
4.9 |
3.0 |
1.4 |
0.2 |
| 2 |
4.7 |
3.2 |
1.3 |
0.2 |
| 3 |
4.6 |
3.1 |
1.5 |
0.2 |
| 4 |
5.0 |
3.6 |
1.4 |
0.2 |
将数据框前三行设置为缺失值
1
| iris.loc[:2, :] = np.nan
|
|
sepal_length |
sepal_width |
petal_length |
petal_width |
| 0 |
NaN |
NaN |
NaN |
NaN |
| 1 |
NaN |
NaN |
NaN |
NaN |
| 2 |
NaN |
NaN |
NaN |
NaN |
| 3 |
4.6 |
3.1 |
1.5 |
0.2 |
| 4 |
5.0 |
3.6 |
1.4 |
0.2 |
删除有缺失值的行
1
| iris.dropna(inplace=True)
|
|
sepal_length |
sepal_width |
petal_length |
petal_width |
| 3 |
4.6 |
3.1 |
1.5 |
0.2 |
| 4 |
5.0 |
3.6 |
1.4 |
0.2 |
| 5 |
5.4 |
3.9 |
1.7 |
0.4 |
| 6 |
4.6 |
3.4 |
1.4 |
0.3 |
| 7 |
5.0 |
3.4 |
1.5 |
0.2 |
重新设置索引
1 2 3 4
|
iris.reset_index(drop=True, inplace=True)
|
|
sepal_length |
sepal_width |
petal_length |
petal_width |
| 0 |
4.6 |
3.1 |
1.5 |
0.2 |
| 1 |
5.0 |
3.6 |
1.4 |
0.2 |
| 2 |
5.4 |
3.9 |
1.7 |
0.4 |
| 3 |
4.6 |
3.4 |
1.4 |
0.3 |
| 4 |
5.0 |
3.4 |
1.5 |
0.2 |
1 2
| iris_new = iris.iloc[2:, :] iris_new.head()
|
|
sepal_length |
sepal_width |
petal_length |
petal_width |
| 2 |
5.4 |
3.9 |
1.7 |
0.4 |
| 3 |
4.6 |
3.4 |
1.4 |
0.3 |
| 4 |
5.0 |
3.4 |
1.5 |
0.2 |
| 5 |
4.4 |
2.9 |
1.4 |
0.2 |
| 6 |
4.9 |
3.1 |
1.0 |
0.1 |
1 2
| iris_new.index = range(len(iris_new))
|
|
sepal_length |
sepal_width |
petal_length |
petal_width |
| 0 |
5.4 |
3.9 |
1.7 |
0.4 |
| 1 |
4.6 |
3.4 |
1.4 |
0.3 |
| 2 |
5.0 |
3.4 |
1.5 |
0.2 |
| 3 |
4.4 |
2.9 |
1.4 |
0.2 |
| 4 |
4.9 |
3.1 |
1.0 |
0.1 |
探索Chipotle快餐数据
数据集下载
数据读取
1 2
| chipo = pd.read_csv('chipotle.tsv', delimiter='\t') chipo.head()
|
|
order_id |
quantity |
item_name |
choice_description |
item_price |
| 0 |
1 |
1 |
Chips and Fresh Tomato Salsa |
NaN |
$2.39 |
| 1 |
1 |
1 |
Izze |
[Clementine] |
$3.39 |
| 2 |
1 |
1 |
Nantucket Nectar |
[Apple] |
$3.39 |
| 3 |
1 |
1 |
Chips and Tomatillo-Green Chili Salsa |
NaN |
$2.39 |
| 4 |
2 |
2 |
Chicken Bowl |
[Tomatillo-Red Chili Salsa (Hot), [Black Beans... |
$16.98 |
查看前10行内容
|
order_id |
quantity |
item_name |
choice_description |
item_price |
| 0 |
1 |
1 |
Chips and Fresh Tomato Salsa |
NaN |
$2.39 |
| 1 |
1 |
1 |
Izze |
[Clementine] |
$3.39 |
| 2 |
1 |
1 |
Nantucket Nectar |
[Apple] |
$3.39 |
| 3 |
1 |
1 |
Chips and Tomatillo-Green Chili Salsa |
NaN |
$2.39 |
| 4 |
2 |
2 |
Chicken Bowl |
[Tomatillo-Red Chili Salsa (Hot), [Black Beans... |
$16.98 |
| 5 |
3 |
1 |
Chicken Bowl |
[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... |
$10.98 |
| 6 |
3 |
1 |
Side of Chips |
NaN |
$1.69 |
| 7 |
4 |
1 |
Steak Burrito |
[Tomatillo Red Chili Salsa, [Fajita Vegetables... |
$11.75 |
| 8 |
4 |
1 |
Steak Soft Tacos |
[Tomatillo Green Chili Salsa, [Pinto Beans, Ch... |
$9.25 |
| 9 |
5 |
1 |
Steak Burrito |
[Fresh Tomato Salsa, [Rice, Black Beans, Pinto... |
$9.25 |
数据集中有多少个列(columns)
(4622, 5)
1 2
| index_num, columns_num = chipo.shape
|
1
| print(f'chipo数据框的行数为{index_num}, 列数为{columns_num}')
|
chipo数据框的行数为4622, 列数为5
打印出全部的列名称
Index(['order_id', 'quantity', 'item_name', 'choice_description',
'item_price'],
dtype='object')
5
数据集的索引是怎样的
RangeIndex(start=0, stop=4622, step=1)
被下单数最多商品(item)是什么?
1 2 3
| item_all_quantity = chipo.groupby('item_name').agg({'quantity': 'sum'})['quantity'] item_all_quantity[:5]
|
item_name
6 Pack Soft Drink 55
Barbacoa Bowl 66
Barbacoa Burrito 91
Barbacoa Crispy Tacos 12
Barbacoa Salad Bowl 10
Name: quantity, dtype: int64
1
| item_all_quantity[item_all_quantity == item_all_quantity.max()]
|
item_name
Chicken Bowl 761
Name: quantity, dtype: int64
在item_name这一列中,一共有多少种商品被下单?
1 2
| len(set(chipo['item_name']))
|
50
1 2
| len(np.unique(chipo['item_name']))
|
50
1 2
| len(chipo['item_name'].unique())
|
50
一共有多少个商品被下单?
4972
将item_price转换为浮点数
1 2
| exp_str = chipo['item_price'][0] exp_str
|
'$2.39 '
1 2
| def trans_float(str_): return float(str_.replace('$', '').strip())
|
1 2
| chipo['item_price'] = chipo['item_price'].apply(trans_float)
|
|
order_id |
quantity |
item_name |
choice_description |
item_price |
| 0 |
1 |
1 |
Chips and Fresh Tomato Salsa |
NaN |
2.39 |
| 1 |
1 |
1 |
Izze |
[Clementine] |
3.39 |
| 2 |
1 |
1 |
Nantucket Nectar |
[Apple] |
3.39 |
| 3 |
1 |
1 |
Chips and Tomatillo-Green Chili Salsa |
NaN |
2.39 |
| 4 |
2 |
2 |
Chicken Bowl |
[Tomatillo-Red Chili Salsa (Hot), [Black Beans... |
16.98 |
在该数据集对应的时期内,收入(revenue)是多少?
1
| chipo['revenue'] = chipo['quantity'] * chipo['item_price']
|
|
order_id |
quantity |
item_name |
choice_description |
item_price |
revenue |
| 0 |
1 |
1 |
Chips and Fresh Tomato Salsa |
NaN |
2.39 |
2.39 |
| 1 |
1 |
1 |
Izze |
[Clementine] |
3.39 |
3.39 |
| 2 |
1 |
1 |
Nantucket Nectar |
[Apple] |
3.39 |
3.39 |
| 3 |
1 |
1 |
Chips and Tomatillo-Green Chili Salsa |
NaN |
2.39 |
2.39 |
| 4 |
2 |
2 |
Chicken Bowl |
[Tomatillo-Red Chili Salsa (Hot), [Black Beans... |
16.98 |
33.96 |
39237.02
在该数据集对应的时期内,一共有多少订单?
1834
1
| len(chipo['order_id'].unique())
|
1834
平均每一单(order)对应的总价是多少?
1
| chipo.groupby('order_id').agg({'revenue': 'mean'})
|
|
revenue |
| order_id |
|
| 1 |
2.890000 |
| 2 |
33.960000 |
| 3 |
6.335000 |
| 4 |
10.500000 |
| 5 |
6.850000 |
| ... |
... |
| 1830 |
11.500000 |
| 1831 |
4.300000 |
| 1832 |
6.600000 |
| 1833 |
11.750000 |
| 1834 |
9.583333 |
1834 rows × 1 columns