0%

Python数据科学_5_Python数据分析实训

探索 Iris 鸢尾花数据

1
2
import pandas as pd
import numpy as np

读取数据集

1
2
iris = pd.read_csv('iris.csv', header=None)
iris.head()

数据集下载

0 1 2 3 4
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

修改数据框列名称

1
iris.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
1
iris.head()
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

数据框中有缺失值吗?

1
2
3
iris.isna().sum().sum()
# 得到的结果为0
# 说明:数据框中是不存在缺失值
0

将列petal_length的第十到十九行设置为缺失值

1
2
# 使用默认索引
iris.iloc[9:19, 2] = np.nan
1
iris.iloc[9:19, 2]
9    NaN
10   NaN
11   NaN
12   NaN
13   NaN
14   NaN
15   NaN
16   NaN
17   NaN
18   NaN
Name: petal_length, dtype: float64
1
iris.isna().sum().sum()
10
1
2
# 使用自定义索引进行取值
iris.loc[9:18, 'petal_length']
9    NaN
10   NaN
11   NaN
12   NaN
13   NaN
14   NaN
15   NaN
16   NaN
17   NaN
18   NaN
Name: petal_length, dtype: float64

将petal_length缺失值全部替换为1.0

1
iris.fillna(value=1.0, inplace=True)
1
iris.iloc[9:19, 2]
9     1.0
10    1.0
11    1.0
12    1.0
13    1.0
14    1.0
15    1.0
16    1.0
17    1.0
18    1.0
Name: petal_length, dtype: float64

删除列class

1
iris.drop(['class'], axis=1, inplace=True)
1
iris.head()
sepal_length sepal_width petal_length petal_width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

将数据框前三行设置为缺失值

1
iris.loc[:2, :] = np.nan
1
iris.head()
sepal_length sepal_width petal_length petal_width
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

删除有缺失值的行

1
iris.dropna(inplace=True)
1
iris.head()
sepal_length sepal_width petal_length petal_width
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
5 5.4 3.9 1.7 0.4
6 4.6 3.4 1.4 0.3
7 5.0 3.4 1.5 0.2

重新设置索引

1
2
3
4
# 调用reset_index
# drop: 是否将原来的行索引删除
# 默认情况下会把原来的行索引组一列新列,放到数据框的第一列
iris.reset_index(drop=True, inplace=True)
1
iris.head()
sepal_length sepal_width petal_length petal_width
0 4.6 3.1 1.5 0.2
1 5.0 3.6 1.4 0.2
2 5.4 3.9 1.7 0.4
3 4.6 3.4 1.4 0.3
4 5.0 3.4 1.5 0.2
1
2
iris_new = iris.iloc[2:, :]
iris_new.head()
sepal_length sepal_width petal_length petal_width
2 5.4 3.9 1.7 0.4
3 4.6 3.4 1.4 0.3
4 5.0 3.4 1.5 0.2
5 4.4 2.9 1.4 0.2
6 4.9 3.1 1.0 0.1
1
2
# 对数据框的index属性进行重新赋值
iris_new.index = range(len(iris_new))
1
iris_new.head()
sepal_length sepal_width petal_length petal_width
0 5.4 3.9 1.7 0.4
1 4.6 3.4 1.4 0.3
2 5.0 3.4 1.5 0.2
3 4.4 2.9 1.4 0.2
4 4.9 3.1 1.0 0.1

探索Chipotle快餐数据

数据集下载

数据读取

1
2
chipo = pd.read_csv('chipotle.tsv', delimiter='\t')
chipo.head()
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98

查看前10行内容

1
chipo.head(10)
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98
5 3 1 Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... $10.98
6 3 1 Side of Chips NaN $1.69
7 4 1 Steak Burrito [Tomatillo Red Chili Salsa, [Fajita Vegetables... $11.75
8 4 1 Steak Soft Tacos [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... $9.25
9 5 1 Steak Burrito [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... $9.25

数据集中有多少个列(columns)

1
chipo.shape
(4622, 5)
1
2
# 使用dataframe的shape属性
index_num, columns_num = chipo.shape # 元组的解包
1
print(f'chipo数据框的行数为{index_num}, 列数为{columns_num}')
chipo数据框的行数为4622, 列数为5

打印出全部的列名称

1
chipo.columns
Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')
1
2
# 查看列数的第二种方法
len(chipo.columns)
5

数据集的索引是怎样的

1
chipo.index
RangeIndex(start=0, stop=4622, step=1)

被下单数最多商品(item)是什么?

1
2
3
# 使用分组聚合求出每个商品被下单的总数
item_all_quantity = chipo.groupby('item_name').agg({'quantity': 'sum'})['quantity']
item_all_quantity[:5]
item_name
6 Pack Soft Drink        55
Barbacoa Bowl            66
Barbacoa Burrito         91
Barbacoa Crispy Tacos    12
Barbacoa Salad Bowl      10
Name: quantity, dtype: int64
1
item_all_quantity[item_all_quantity == item_all_quantity.max()]
item_name
Chicken Bowl    761
Name: quantity, dtype: int64

在item_name这一列中,一共有多少种商品被下单?

1
2
# 1. 使用集合
len(set(chipo['item_name']))
50
1
2
# 2. 使用numpy中的unique方法
len(np.unique(chipo['item_name']))
50
1
2
# 3. 使用dataframe的unique方法
len(chipo['item_name'].unique())
50

一共有多少个商品被下单?

1
chipo['quantity'].sum()
4972

将item_price转换为浮点数

1
2
exp_str = chipo['item_price'][0]
exp_str
'$2.39 '
1
2
def trans_float(str_):
return float(str_.replace('$', '').strip())
1
2
# apply: 默认是对数据框中的某一列做相同的函数操作
chipo['item_price'] = chipo['item_price'].apply(trans_float)
1
chipo.head()
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN 2.39
1 1 1 Izze [Clementine] 3.39
2 1 1 Nantucket Nectar [Apple] 3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN 2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... 16.98

在该数据集对应的时期内,收入(revenue)是多少?

1
chipo['revenue'] = chipo['quantity'] * chipo['item_price']
1
chipo.head()
order_id quantity item_name choice_description item_price revenue
0 1 1 Chips and Fresh Tomato Salsa NaN 2.39 2.39
1 1 1 Izze [Clementine] 3.39 3.39
2 1 1 Nantucket Nectar [Apple] 3.39 3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN 2.39 2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... 16.98 33.96
1
chipo['revenue'].sum()
39237.02

在该数据集对应的时期内,一共有多少订单?

1
chipo['order_id'].max()
1834
1
len(chipo['order_id'].unique())
1834

平均每一单(order)对应的总价是多少?

1
chipo.groupby('order_id').agg({'revenue': 'mean'})
revenue
order_id
1 2.890000
2 33.960000
3 6.335000
4 10.500000
5 6.850000
... ...
1830 11.500000
1831 4.300000
1832 6.600000
1833 11.750000
1834 9.583333

1834 rows × 1 columns

-------------本文结束感谢您的阅读-------------