[Pandas] 파이썬으로 데이터 주무르기
Written by
metterian
on May 10, 2021
파이썬으로 데이터 주무르기
pandas를 활용해서 데이터프레임을 다뤄봅시다.
Pandas 시작하기
prerequisite : Table
pandas import하기
Pandas로 1차원 데이터 다루기 - Series
Series 선언하기
Series vs ndarray
Series vs dict
Series에 이름 붙이기
Pandas로 2차원 데이터 다루기 - dataframe
dataframe 선언하기
from csv to dataframe
dataframe 자료 접근하기
수업에 사용된 covid 데이터
! kaggle datasets download - d imdevskp / corona - virus - report
I. pandas 시작하기
II. pandas로 1차원 데이터 다루기 - Series
Seires?
1-D labeld array
인덱스를 지정해줄 수 있음
s = pd . Series ([ 1 , 4 , 9 , 16 , 25 ])
s
0 1
1 4
2 9
3 16
4 25
dtype: int64
t = pd . Series ({ 'one' : 1 , 'two' : 2 , 'three' : 3 })
t
one 1
two 2
three 3
dtype: int64
Series + Numpy
s [ s > s . median ()] # 자기 자신의 중앙갑보다 큰값들을 가지고 와라
s [[ 3 , 1 , 4 ]] # 해당 인덱스의 순서 대로 Slicing이 가능하다.
3 16
1 4
4 25
dtype: int64
import numpy as np
np . exp ( s )
0 2.718282e+00
1 5.459815e+01
2 8.103084e+03
3 8.886111e+06
4 7.200490e+10
dtype: float64
Series + dict
one 1
two 2
three 3
dtype: int64
one 1
two 2
three 3
four 4
dtype: int64
t . get ( 'seven' , 0 ) # get() 함수는 값을 가져올 때, 만약 값이 없으면 예외 처리가 가능
Seires 이름 붙히기
name
속성을 가지고 있다.
처음 Series를 만들때 붙일 수 있다.
s = pd . Series ( np . random . randn ( 5 ), name = 'random_nums' )
s
0 -0.269176
1 -0.251213
2 1.486398
3 0.125024
4 -0.445661
Name: random_nums, dtype: float64
0 -0.269176
1 -0.251213
2 1.486398
3 0.125024
4 -0.445661
Name: 임의의 난수, dtype: float64
III. Pandas로 2차원 데이터 다루기 - dataframe
dataframe?
2-D labeled table
인덱스를 지정할 수 있음
d = { 'height' : [ 1 , 2 , 3 , 4 ], 'weight' : [ 30 , 40 , 50 , 60 ]}
df = pd . DataFrame ( d )
df
height
weight
0
1
30
1
2
40
2
3
50
3
4
60
df . dtypes # 복수형인 이유는 column 마다 dtype이 다를 수 있기 때문에
height int64
weight int64
dtype: object
From CSV to dataframe
comma Sperated value를 Dataframe으로 생성해줄 수 있다.
read.csv()
를 사용
covid = pd . read_csv ( "./data/country_wise_latest.csv" )
covid . head ()
Country/Region
Confirmed
Deaths
Recovered
Active
New cases
New deaths
New recovered
Deaths / 100 Cases
Recovered / 100 Cases
Deaths / 100 Recovered
Confirmed last week
1 week change
1 week % increase
WHO Region
0
Afghanistan
36263
1269
25198
9796
106
10
18
3.50
69.49
5.04
35526
737
2.07
Eastern Mediterranean
1
Albania
4880
144
2745
1991
117
6
63
2.95
56.25
5.25
4171
709
17.00
Europe
2
Algeria
27973
1163
18837
7973
616
8
749
4.16
67.34
6.17
23691
4282
18.07
Africa
3
Andorra
907
52
803
52
10
0
0
5.73
88.53
6.48
884
23
2.60
Europe
4
Angola
950
41
242
667
18
1
0
4.32
25.47
16.94
749
201
26.84
Africa
Country/Region
Confirmed
Deaths
Recovered
Active
New cases
New deaths
New recovered
Deaths / 100 Cases
Recovered / 100 Cases
Deaths / 100 Recovered
Confirmed last week
1 week change
1 week % increase
WHO Region
182
West Bank and Gaza
10621
78
3752
6791
152
2
0
0.73
35.33
2.08
8916
1705
19.12
Eastern Mediterranean
183
Western Sahara
10
1
8
1
0
0
0
10.00
80.00
12.50
10
0
0.00
Africa
184
Yemen
1691
483
833
375
10
4
36
28.56
49.26
57.98
1619
72
4.45
Eastern Mediterranean
185
Zambia
4552
140
2815
1597
71
1
465
3.08
61.84
4.97
3326
1226
36.86
Africa
186
Zimbabwe
2704
36
542
2126
192
2
24
1.33
20.04
6.64
1713
991
57.85
Africa
Pandas 활용 2.데이터 접근하기
df['column_name']
or df.column_name
Active
0
9796
1
1991
2
7973
3
52
4
667
...
...
182
6791
183
1
184
375
185
1597
186
2126
187 rows × 1 columns
0 9796
1 1991
2 7973
3 52
4 667
...
182 6791
183 1
184 375
185 1597
186 2126
Name: Active, Length: 187, dtype: int64
Tip: Dataframe의 각 column은 “Series” 다!
1 4880
2 27973
3 907
4 950
Name: Confirmed, dtype: int64
Pandas 활용 3. “조건”을 이용해서 데이터 접근하기
# 신규 확진자가 100명이 넘는 나라를 찾아 보자!
covid [ covid [ 'New cases' ] > 100 ]. head ()
Country/Region
Confirmed
Deaths
Recovered
Active
New cases
New deaths
New recovered
Deaths / 100 Cases
Recovered / 100 Cases
Deaths / 100 Recovered
Confirmed last week
1 week change
1 week % increase
WHO Region
0
Afghanistan
36263
1269
25198
9796
106
10
18
3.50
69.49
5.04
35526
737
2.07
Eastern Mediterranean
1
Albania
4880
144
2745
1991
117
6
63
2.95
56.25
5.25
4171
709
17.00
Europe
2
Algeria
27973
1163
18837
7973
616
8
749
4.16
67.34
6.17
23691
4282
18.07
Africa
6
Argentina
167416
3059
72575
91782
4890
120
2057
1.83
43.35
4.21
130774
36642
28.02
Americas
8
Australia
15303
167
9311
5825
368
6
137
1.09
60.84
1.79
12428
2875
23.13
Western Pacific
# WHO 지역이 동남아인 나라 찾기
covid [ 'WHO Region' ]. unique ()
array(['Eastern Mediterranean', 'Europe', 'Africa', 'Americas',
'Western Pacific', 'South-East Asia'], dtype=object)
covid [ covid [ 'WHO Region' ] == 'South-East Asia' ]. head ()
Country/Region
Confirmed
Deaths
Recovered
Active
New cases
New deaths
New recovered
Deaths / 100 Cases
Recovered / 100 Cases
Deaths / 100 Recovered
Confirmed last week
1 week change
1 week % increase
WHO Region
13
Bangladesh
226225
2965
125683
97577
2772
37
1801
1.31
55.56
2.36
207453
18772
9.05
South-East Asia
19
Bhutan
99
0
86
13
4
0
1
0.00
86.87
0.00
90
9
10.00
South-East Asia
27
Burma
350
6
292
52
0
0
2
1.71
83.43
2.05
341
9
2.64
South-East Asia
79
India
1480073
33408
951166
495499
44457
637
33598
2.26
64.26
3.51
1155338
324735
28.11
South-East Asia
80
Indonesia
100303
4838
58173
37292
1525
57
1518
4.82
58.00
8.32
88214
12089
13.70
South-East Asia
Pandas 활용 4. 행을 기준으로 데이터 접근하기
# 예시 데이터 - 도서관 정보
book_dict = { "Available" : [ True , True , False ],
"Location" : [ 102 , 215 , 323 ],
"Genre" : [ "Programming" , "Physics" , "Math" ]}
book_df = pd . DataFrame ( book_dict ,
index = [ '버그란 무엇인가' , ' 두근두근 물리학' , ' 미분해줘 홈즈' ])
book_df
Available
Location
Genre
버그란 무엇인가
True
102
Programming
두근두근 물리학
True
215
Physics
미분해줘 홈즈
False
323
Math
인덱스를 이용해서 가져오기
Available True
Location 102
Genre Programming
Name: 버그란 무엇인가, dtype: object
# "미분해줘 홈즈 책이 대출 가능한지 ?"
book_df . loc [ "버그란 무엇인가" , 'Available' ]
숫자 인덱스를 이용해서 가져오기
Available False
Location 323
Genre Math
Name: 미분해줘 홈즈, dtype: object
# 2행의 1열 데이터 가져오기
book_df . iloc [ 2 , 1 ]
# 인덱스 1행의 2~3행 데이터 가져오기
book_df . iloc [ 1 , 1 : 3 ]
Location 215
Genre Physics
Name: 두근두근 물리학, dtype: object
Pandas 활용 5.groupby
split : 특정한 “기준”을 바타으로 Dataframe을 분할
apply : 통계함수 -sum(), mean(), median(), -을 적용해서 각 데이터를 압축
combine : Apply된 결과를 바탕으로 새로운 Series생성 (group_key: applied_value)
Country/Region
Confirmed
Deaths
Recovered
Active
New cases
New deaths
New recovered
Deaths / 100 Cases
Recovered / 100 Cases
Deaths / 100 Recovered
Confirmed last week
1 week change
1 week % increase
WHO Region
0
Afghanistan
36263
1269
25198
9796
106
10
18
3.50
69.49
5.04
35526
737
2.07
Eastern Mediterranean
1
Albania
4880
144
2745
1991
117
6
63
2.95
56.25
5.25
4171
709
17.00
Europe
2
Algeria
27973
1163
18837
7973
616
8
749
4.16
67.34
6.17
23691
4282
18.07
Africa
3
Andorra
907
52
803
52
10
0
0
5.73
88.53
6.48
884
23
2.60
Europe
4
Angola
950
41
242
667
18
1
0
4.32
25.47
16.94
749
201
26.84
Africa
# WHO Region 별 확진자 수
# 1. Covid에서 확진자 수 column만 추출한다
# 2. 이를 covid의 WHO Region을 기준으로 groupby 한다.
covid_by_region = covid [ 'Confirmed' ]. groupby ( by = covid [ 'WHO Region' ])
covid_by_region
<pandas.core.groupby.generic.SeriesGroupBy object at 0x11fa13190>
WHO Region
Africa 723207
Americas 8839286
Eastern Mediterranean 1490744
Europe 3299523
South-East Asia 1835297
Western Pacific 292428
Name: Confirmed, dtype: int64
# 국가당 감염자 수
covid_by_region . mean ()
WHO Region
Africa 15066.812500
Americas 252551.028571
Eastern Mediterranean 67761.090909
Europe 58920.053571
South-East Asia 183529.700000
Western Pacific 18276.750000
Name: Confirmed, dtype: float64
Mission:
1. covid 데이터에서 100 case 대비 사망률(Deaths / 100 Cases
)이 가장 높은 국가는?
Country/Region
Confirmed
Deaths
Recovered
Active
New cases
New deaths
New recovered
Deaths / 100 Cases
Recovered / 100 Cases
Deaths / 100 Recovered
Confirmed last week
1 week change
1 week % increase
WHO Region
0
Afghanistan
36263
1269
25198
9796
106
10
18
3.50
69.49
5.04
35526
737
2.07
Eastern Mediterranean
1
Albania
4880
144
2745
1991
117
6
63
2.95
56.25
5.25
4171
709
17.00
Europe
2
Algeria
27973
1163
18837
7973
616
8
749
4.16
67.34
6.17
23691
4282
18.07
Africa
covid . sort_values ( by = 'Recovered / 100 Cases' , ascending = True ). head ( 3 )
Country/Region
Confirmed
Deaths
Recovered
Active
New cases
New deaths
New recovered
Deaths / 100 Cases
Recovered / 100 Cases
Deaths / 100 Recovered
Confirmed last week
1 week change
1 week % increase
WHO Region
32
Canada
116458
8944
0
107514
682
11
0
7.68
0.0
inf
112925
3533
3.13
Americas
147
Serbia
24141
543
0
23598
411
9
0
2.25
0.0
inf
21253
2888
13.59
Europe
161
Sweden
79395
5700
0
73695
398
3
0
7.18
0.0
inf
78048
1347
1.73
Europe
2. covid 데이터에서 신규 확진자가 없는 나라 중 WHO Region이 ‘Europe’를 모두 출력하면?
Hint : 한 줄에 동시에 두가지 조건을 Apply하는 경우 Warning이 발생할 수 있습니다.
europe = covid [ covid [ 'WHO Region' ] == 'Europe' ]
europe [ europe [ 'New cases' ] == 0 ]
Country/Region
Confirmed
Deaths
Recovered
Active
New cases
New deaths
New recovered
Deaths / 100 Cases
Recovered / 100 Cases
Deaths / 100 Recovered
Confirmed last week
1 week change
1 week % increase
WHO Region
56
Estonia
2034
69
1923
42
0
0
1
3.39
94.54
3.59
2021
13
0.64
Europe
75
Holy See
12
0
12
0
0
0
0
0.00
100.00
0.00
12
0
0.00
Europe
95
Latvia
1219
31
1045
143
0
0
0
2.54
85.73
2.97
1192
27
2.27
Europe
100
Liechtenstein
86
1
81
4
0
0
0
1.16
94.19
1.23
86
0
0.00
Europe
113
Monaco
116
4
104
8
0
0
0
3.45
89.66
3.85
109
7
6.42
Europe
143
San Marino
699
42
657
0
0
0
0
6.01
93.99
6.39
699
0
0.00
Europe
157
Spain
272421
28432
150376
93613
0
0
0
10.44
55.20
18.91
264836
7585
2.86
Europe
3. 다음 데이터 를 이용해 각 Region별로 아보카도가 가장 비싼 평균가격(AveragePrice)을 출력하면?
! kaggle datasets download - d neuromusic / avocado - prices
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /Users/seungjun/.kaggle/kaggle.json'
Downloading avocado-prices.zip to /Users/seungjun/Study/promgrammers/week3/day4
0%| | 0.00/629k [00:00<?, ?B/s]
100%|████████████████████████████████████████| 629k/629k [00:00<00:00, 10.7MB/s]
! unzip avocado - prices . zip - d . / avocado /
Archive: avocado-prices.zip
inflating: ./avocado/avocado.csv
avocado = pd . read_csv ( 'avocado/avocado.csv' )
avocado . head ()
Unnamed: 0
Date
AveragePrice
Total Volume
4046
4225
4770
Total Bags
Small Bags
Large Bags
XLarge Bags
type
year
region
0
0
2015-12-27
1.33
64236.62
1036.74
54454.85
48.16
8696.87
8603.62
93.25
0.0
conventional
2015
Albany
1
1
2015-12-20
1.35
54876.98
674.28
44638.81
58.33
9505.56
9408.07
97.49
0.0
conventional
2015
Albany
2
2
2015-12-13
0.93
118220.22
794.70
109149.67
130.50
8145.35
8042.21
103.14
0.0
conventional
2015
Albany
3
3
2015-12-06
1.08
78992.15
1132.00
71976.41
72.58
5811.16
5677.40
133.76
0.0
conventional
2015
Albany
4
4
2015-11-29
1.28
51039.60
941.48
43838.39
75.78
6183.95
5986.26
197.69
0.0
conventional
2015
Albany
avocado . sort_values ( by = 'AveragePrice' , ascending = False ). head ( 3 )
Unnamed: 0
Date
AveragePrice
Total Volume
4046
4225
4770
Total Bags
Small Bags
Large Bags
XLarge Bags
type
year
region
14125
8
2016-10-30
3.25
16700.94
2325.93
11142.85
0.0
3232.16
3232.16
0.00
0.0
organic
2016
SanFrancisco
17428
37
2017-04-16
3.17
3018.56
1255.55
82.31
0.0
1680.70
1542.22
138.48
0.0
organic
2017
Tampa
14124
7
2016-11-06
3.12
19043.80
5898.49
10039.34
0.0
3105.97
3079.30
26.67
0.0
organic
2016
SanFrancisco