[Pandas] 파이썬으로 데이터 주무르기

파이썬으로 데이터 주무르기

pandas를 활용해서 데이터프레임을 다뤄봅시다.

  1. Pandas 시작하기
    • prerequisite : Table
    • pandas import하기
  2. Pandas로 1차원 데이터 다루기 - Series
    • Series 선언하기
    • Series vs ndarray
    • Series vs dict
    • Series에 이름 붙이기
  3. Pandas로 2차원 데이터 다루기 - dataframe
    • dataframe 선언하기
    • from csv to dataframe
    • dataframe 자료 접근하기

수업에 사용된 covid 데이터

!kaggle datasets download -d imdevskp/corona-virus-report

I. pandas 시작하기

import pandas as pd

II. pandas로 1차원 데이터 다루기 - Series

Seires?

s = pd.Series([1,4,9,16,25])
s
0     1
1     4
2     9
3    16
4    25
dtype: int64
t = pd.Series({'one': 1, 'two': 2, 'three':3})
t
one      1
two      2
three    3
dtype: int64

Series + Numpy

s[1]
4
t[:2]
one    1
two    2
dtype: int64
s[s > s.median()] # 자기 자신의 중앙갑보다 큰값들을 가지고 와라
3    16
4    25
dtype: int64
s[[3,1,4]] # 해당 인덱스의 순서 대로 Slicing이 가능하다.
3    16
1     4
4    25
dtype: int64
import numpy as np
np.exp(s)
0    2.718282e+00
1    5.459815e+01
2    8.103084e+03
3    8.886111e+06
4    7.200490e+10
dtype: float64
s.dtype
dtype('int64')

Series + dict

t
one      1
two      2
three    3
dtype: int64
t['one']
1
t['four'] = 4
t
one      1
two      2
three    3
four     4
dtype: int64
'four' in t
True
'six' in t
False
t.get('seven', 0) # get() 함수는 값을 가져올 때, 만약 값이 없으면 예외 처리가 가능
0

Seires 이름 붙히기

s = pd.Series(np.random.randn(5), name='random_nums')
s
0   -0.269176
1   -0.251213
2    1.486398
3    0.125024
4   -0.445661
Name: random_nums, dtype: float64
s.name = '임의의 난수'
s
0   -0.269176
1   -0.251213
2    1.486398
3    0.125024
4   -0.445661
Name: 임의의 난수, dtype: float64

III. Pandas로 2차원 데이터 다루기 - dataframe

dataframe?

d = {'height': [1,2,3,4], 'weight': [30,40,50,60]}
df = pd.DataFrame(d)
df
height weight
0 1 30
1 2 40
2 3 50
3 4 60
df.dtypes # 복수형인 이유는 column 마다 dtype이 다를 수 있기 때문에
height    int64
weight    int64
dtype: object

From CSV to dataframe

covid = pd.read_csv("./data/country_wise_latest.csv")
covid.head()
Country/Region Confirmed Deaths Recovered Active New cases New deaths New recovered Deaths / 100 Cases Recovered / 100 Cases Deaths / 100 Recovered Confirmed last week 1 week change 1 week % increase WHO Region
0 Afghanistan 36263 1269 25198 9796 106 10 18 3.50 69.49 5.04 35526 737 2.07 Eastern Mediterranean
1 Albania 4880 144 2745 1991 117 6 63 2.95 56.25 5.25 4171 709 17.00 Europe
2 Algeria 27973 1163 18837 7973 616 8 749 4.16 67.34 6.17 23691 4282 18.07 Africa
3 Andorra 907 52 803 52 10 0 0 5.73 88.53 6.48 884 23 2.60 Europe
4 Angola 950 41 242 667 18 1 0 4.32 25.47 16.94 749 201 26.84 Africa
covid.tail()
Country/Region Confirmed Deaths Recovered Active New cases New deaths New recovered Deaths / 100 Cases Recovered / 100 Cases Deaths / 100 Recovered Confirmed last week 1 week change 1 week % increase WHO Region
182 West Bank and Gaza 10621 78 3752 6791 152 2 0 0.73 35.33 2.08 8916 1705 19.12 Eastern Mediterranean
183 Western Sahara 10 1 8 1 0 0 0 10.00 80.00 12.50 10 0 0.00 Africa
184 Yemen 1691 483 833 375 10 4 36 28.56 49.26 57.98 1619 72 4.45 Eastern Mediterranean
185 Zambia 4552 140 2815 1597 71 1 465 3.08 61.84 4.97 3326 1226 36.86 Africa
186 Zimbabwe 2704 36 542 2126 192 2 24 1.33 20.04 6.64 1713 991 57.85 Africa

Pandas 활용 2.데이터 접근하기

covid[['Active']]
Active
0 9796
1 1991
2 7973
3 52
4 667
... ...
182 6791
183 1
184 375
185 1597
186 2126

187 rows × 1 columns

covid.Active
0      9796
1      1991
2      7973
3        52
4       667
       ...
182    6791
183       1
184     375
185    1597
186    2126
Name: Active, Length: 187, dtype: int64

Tip: Dataframe의 각 column은 “Series” 다!

covid['Confirmed'][1:5]
1     4880
2    27973
3      907
4      950
Name: Confirmed, dtype: int64

Pandas 활용 3. “조건”을 이용해서 데이터 접근하기

# 신규 확진자가 100명이 넘는 나라를 찾아 보자!
covid[covid['New cases'] > 100].head()
Country/Region Confirmed Deaths Recovered Active New cases New deaths New recovered Deaths / 100 Cases Recovered / 100 Cases Deaths / 100 Recovered Confirmed last week 1 week change 1 week % increase WHO Region
0 Afghanistan 36263 1269 25198 9796 106 10 18 3.50 69.49 5.04 35526 737 2.07 Eastern Mediterranean
1 Albania 4880 144 2745 1991 117 6 63 2.95 56.25 5.25 4171 709 17.00 Europe
2 Algeria 27973 1163 18837 7973 616 8 749 4.16 67.34 6.17 23691 4282 18.07 Africa
6 Argentina 167416 3059 72575 91782 4890 120 2057 1.83 43.35 4.21 130774 36642 28.02 Americas
8 Australia 15303 167 9311 5825 368 6 137 1.09 60.84 1.79 12428 2875 23.13 Western Pacific
# WHO 지역이 동남아인 나라 찾기
covid['WHO Region'].unique()
array(['Eastern Mediterranean', 'Europe', 'Africa', 'Americas',
       'Western Pacific', 'South-East Asia'], dtype=object)
covid[covid['WHO Region']=='South-East Asia'].head()
Country/Region Confirmed Deaths Recovered Active New cases New deaths New recovered Deaths / 100 Cases Recovered / 100 Cases Deaths / 100 Recovered Confirmed last week 1 week change 1 week % increase WHO Region
13 Bangladesh 226225 2965 125683 97577 2772 37 1801 1.31 55.56 2.36 207453 18772 9.05 South-East Asia
19 Bhutan 99 0 86 13 4 0 1 0.00 86.87 0.00 90 9 10.00 South-East Asia
27 Burma 350 6 292 52 0 0 2 1.71 83.43 2.05 341 9 2.64 South-East Asia
79 India 1480073 33408 951166 495499 44457 637 33598 2.26 64.26 3.51 1155338 324735 28.11 South-East Asia
80 Indonesia 100303 4838 58173 37292 1525 57 1518 4.82 58.00 8.32 88214 12089 13.70 South-East Asia

Pandas 활용 4. 행을 기준으로 데이터 접근하기

# 예시 데이터 - 도서관 정보
book_dict = {"Available": [True, True, False],
             "Location": [102, 215, 323],
             "Genre": ["Programming", "Physics", "Math"]}
book_df = pd.DataFrame(book_dict,
                      index=['버그란 무엇인가',' 두근두근 물리학',' 미분해줘 홈즈'])

book_df
Available Location Genre
버그란 무엇인가 True 102 Programming
두근두근 물리학 True 215 Physics
미분해줘 홈즈 False 323 Math

인덱스를 이용해서 가져오기

book_df.loc["버그란 무엇인가"]
Available           True
Location             102
Genre        Programming
Name: 버그란 무엇인가, dtype: object
# "미분해줘 홈즈 책이 대출 가능한지 ?"
book_df.loc["버그란 무엇인가", 'Available']
True

숫자 인덱스를 이용해서 가져오기

book_df.iloc[2]
Available    False
Location       323
Genre         Math
Name:  미분해줘 홈즈, dtype: object
# 2행의 1열 데이터 가져오기
book_df.iloc[2,1]
323
# 인덱스 1행의 2~3행 데이터 가져오기
book_df.iloc[1, 1:3]
Location        215
Genre       Physics
Name:  두근두근 물리학, dtype: object

Pandas 활용 5.groupby

covid.head()
Country/Region Confirmed Deaths Recovered Active New cases New deaths New recovered Deaths / 100 Cases Recovered / 100 Cases Deaths / 100 Recovered Confirmed last week 1 week change 1 week % increase WHO Region
0 Afghanistan 36263 1269 25198 9796 106 10 18 3.50 69.49 5.04 35526 737 2.07 Eastern Mediterranean
1 Albania 4880 144 2745 1991 117 6 63 2.95 56.25 5.25 4171 709 17.00 Europe
2 Algeria 27973 1163 18837 7973 616 8 749 4.16 67.34 6.17 23691 4282 18.07 Africa
3 Andorra 907 52 803 52 10 0 0 5.73 88.53 6.48 884 23 2.60 Europe
4 Angola 950 41 242 667 18 1 0 4.32 25.47 16.94 749 201 26.84 Africa
# WHO Region 별 확진자 수

# 1. Covid에서 확진자 수 column만 추출한다
# 2. 이를 covid의 WHO Region을 기준으로 groupby 한다.

covid_by_region = covid['Confirmed'].groupby(by=covid['WHO Region'])
covid_by_region
<pandas.core.groupby.generic.SeriesGroupBy object at 0x11fa13190>
covid_by_region.sum()
WHO Region
Africa                    723207
Americas                 8839286
Eastern Mediterranean    1490744
Europe                   3299523
South-East Asia          1835297
Western Pacific           292428
Name: Confirmed, dtype: int64
# 국가당 감염자 수
covid_by_region.mean()
WHO Region
Africa                    15066.812500
Americas                 252551.028571
Eastern Mediterranean     67761.090909
Europe                    58920.053571
South-East Asia          183529.700000
Western Pacific           18276.750000
Name: Confirmed, dtype: float64

Mission:

1. covid 데이터에서 100 case 대비 사망률(Deaths / 100 Cases)이 가장 높은 국가는?

covid.head(3)
Country/Region Confirmed Deaths Recovered Active New cases New deaths New recovered Deaths / 100 Cases Recovered / 100 Cases Deaths / 100 Recovered Confirmed last week 1 week change 1 week % increase WHO Region
0 Afghanistan 36263 1269 25198 9796 106 10 18 3.50 69.49 5.04 35526 737 2.07 Eastern Mediterranean
1 Albania 4880 144 2745 1991 117 6 63 2.95 56.25 5.25 4171 709 17.00 Europe
2 Algeria 27973 1163 18837 7973 616 8 749 4.16 67.34 6.17 23691 4282 18.07 Africa
covid.sort_values(by='Recovered / 100 Cases',ascending=True).head(3)
Country/Region Confirmed Deaths Recovered Active New cases New deaths New recovered Deaths / 100 Cases Recovered / 100 Cases Deaths / 100 Recovered Confirmed last week 1 week change 1 week % increase WHO Region
32 Canada 116458 8944 0 107514 682 11 0 7.68 0.0 inf 112925 3533 3.13 Americas
147 Serbia 24141 543 0 23598 411 9 0 2.25 0.0 inf 21253 2888 13.59 Europe
161 Sweden 79395 5700 0 73695 398 3 0 7.18 0.0 inf 78048 1347 1.73 Europe

2. covid 데이터에서 신규 확진자가 없는 나라 중 WHO Region이 ‘Europe’를 모두 출력하면?

Hint : 한 줄에 동시에 두가지 조건을 Apply하는 경우 Warning이 발생할 수 있습니다.

europe = covid[covid['WHO Region']=='Europe']
europe[europe['New cases']==0]
Country/Region Confirmed Deaths Recovered Active New cases New deaths New recovered Deaths / 100 Cases Recovered / 100 Cases Deaths / 100 Recovered Confirmed last week 1 week change 1 week % increase WHO Region
56 Estonia 2034 69 1923 42 0 0 1 3.39 94.54 3.59 2021 13 0.64 Europe
75 Holy See 12 0 12 0 0 0 0 0.00 100.00 0.00 12 0 0.00 Europe
95 Latvia 1219 31 1045 143 0 0 0 2.54 85.73 2.97 1192 27 2.27 Europe
100 Liechtenstein 86 1 81 4 0 0 0 1.16 94.19 1.23 86 0 0.00 Europe
113 Monaco 116 4 104 8 0 0 0 3.45 89.66 3.85 109 7 6.42 Europe
143 San Marino 699 42 657 0 0 0 0 6.01 93.99 6.39 699 0 0.00 Europe
157 Spain 272421 28432 150376 93613 0 0 0 10.44 55.20 18.91 264836 7585 2.86 Europe

3. 다음 데이터를 이용해 각 Region별로 아보카도가 가장 비싼 평균가격(AveragePrice)을 출력하면?

!kaggle datasets download -d neuromusic/avocado-prices
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /Users/seungjun/.kaggle/kaggle.json'
Downloading avocado-prices.zip to /Users/seungjun/Study/promgrammers/week3/day4
  0%|                                                | 0.00/629k [00:00<?, ?B/s]
100%|████████████████████████████████████████| 629k/629k [00:00<00:00, 10.7MB/s]
!unzip avocado-prices.zip -d ./avocado/
Archive:  avocado-prices.zip
  inflating: ./avocado/avocado.csv
avocado = pd.read_csv('avocado/avocado.csv')
avocado.head()
Unnamed: 0 Date AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region
0 0 2015-12-27 1.33 64236.62 1036.74 54454.85 48.16 8696.87 8603.62 93.25 0.0 conventional 2015 Albany
1 1 2015-12-20 1.35 54876.98 674.28 44638.81 58.33 9505.56 9408.07 97.49 0.0 conventional 2015 Albany
2 2 2015-12-13 0.93 118220.22 794.70 109149.67 130.50 8145.35 8042.21 103.14 0.0 conventional 2015 Albany
3 3 2015-12-06 1.08 78992.15 1132.00 71976.41 72.58 5811.16 5677.40 133.76 0.0 conventional 2015 Albany
4 4 2015-11-29 1.28 51039.60 941.48 43838.39 75.78 6183.95 5986.26 197.69 0.0 conventional 2015 Albany
avocado.sort_values(by='AveragePrice', ascending=False).head(3)
Unnamed: 0 Date AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region
14125 8 2016-10-30 3.25 16700.94 2325.93 11142.85 0.0 3232.16 3232.16 0.00 0.0 organic 2016 SanFrancisco
17428 37 2017-04-16 3.17 3018.56 1255.55 82.31 0.0 1680.70 1542.22 138.48 0.0 organic 2017 Tampa
14124 7 2016-11-06 3.12 19043.80 5898.49 10039.34 0.0 3105.97 3079.30 26.67 0.0 organic 2016 SanFrancisco