로운's 기술노트

[문법] Pandas2 (10 minutes to pandas) 본문

내배캠_데이터분석가_'24.04~08/파이썬

[문법] Pandas2 (10 minutes to pandas)

로운's 2024. 5. 14. 00:03

ㅇ 사용방법

import numpy as np
import pandas as pd

ㅇ 판다스에서 제공하는 데이터 타입

- Series : 1차원 형태의 데이터 (정수, 문자열, Python 객체 등 대부분의 타입이 들어 갈 수 있음)

- DataFrame : 2차원 배열 형태의 표(행과 열)로 이루어진 데이터

 

* 0차원 = schalar(스칼라) = 단일 값
  1차원 = vector(벡터) = 리스트
  2차원 = matrix(행렬) = 2중 중첩 리스트
  3차원 이상 = tensor(텐서) = 3중 이상 중첩 리스트

# Series
>> s = pd.Series([1, 3, 5, np.nan, 6, 8])
>> s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

# DataFrame
>> df2 = pd.DataFrame(
   ...:     {
   ...:         "A": 1.0,
   ...:         "B": pd.Timestamp("20130102"),
   ...:         "C": pd.Series(1, index=list(range(4)), dtype="float32"),
   ...:         "D": np.array([3] * 4, dtype="int32"),
   ...:         "E": pd.Categorical(["test", "train", "test", "train"]),
   ...:         "F": "foo",
   ...:     }
   ...: )
   ...: 
>> df2
     A      B       C   D    E     F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

ㅇ Object creation

1. 리스트에 값을 넣어 Series 데이터 타입 생성가능

    참고) pandas에서는 RangeIndex 함수에 의해서, 따로 지정하지 않는 이상 자동으로 index가 채워짐

             * 기본값 :  RangeIndex(0, 1, 2, ..., n)

             * RangeIndex : range와 기능은 동일하나 사용이 좀더 편리(시작값 인자 생략 가능)

    - 참고 : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html
     https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

>> s2 = pd.DataFrame([1,3,5, np.nan, 6, 8], index = [3,4,5,6,7,8])
>> s2
     0
3	1.0
4	3.0
5	5.0
6	NaN
7	6.0
8	8.0

 

2. numpy array를 이용하여 dataframe을 만들 수 있음.

    참고) data_range 함수를 이용해서 연속된 날짜 정보들을 만들 수 있고 dataframe의 index로 집어넣을수 있음.

>> dates = pd.date_range("20240513", periods=6)		# periods : 연속된 n개의 데이터 출력
>> dates
DatetimeIndex(['2024-05-13', '2024-05-14', '2024-05-15', '2024-05-16',
               '2024-05-17', '2024-05-18'],
              dtype='datetime64[ns]', freq='D')
              
>> np.random.randn(6, 4)		# np.random.randn(행,열)  # 랜덤 값의 행과 열을 만든다. array로 생성됨
array([[ 0.9111312 , -0.02774801, -0.01656051,  1.40677869],
       [ 0.01461194,  1.73088185,  0.66534308,  0.23268069],
       [-0.14197384,  0.47384894,  1.53984158,  0.53825697],
       [ 0.35474407,  1.3404046 ,  0.36081095, -0.51907604],
       [-0.82193688,  0.49241182,  1.25183249, -0.05405669],
       [-1.23050981,  1.42889765,  0.74162111, -2.81065958]])

>> df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
>> df
                   A         B         C         D
2024-05-13  0.911131 -0.027748 -0.016561  1.406779
2024-05-14  0.014612  1.730882  0.665343  0.232681
2024-05-15 -0.141974  0.473849  1.539842  0.538257
2024-05-16  0.354744  1.340405  0.360811 -0.519076
2024-05-17 -0.821937  0.492412  1.251832 -0.054057
2024-05-18 -1.230510  1.428898  0.741621 -2.810660

 

3. 딕셔너리를 dataframe 형태로 만들 수 있음. 이때 key는 column으로, value는 각 column의 값으로 지정됨.

    참고) 각 column 마다 서로 다른 type의 데이터들을 가질 수 있음.

 

>> df2 = pd.DataFrame(
        {
            "A": 1.0,
            "B": pd.Timestamp("20130102"),
            "C": pd.Series(1, index=list(range(4)), dtype="float32"),
            "D": np.array([3] * 4, dtype="int32"),
            "E": pd.Categorical(["test", "train", "test", "train"]),
            "F": "foo"
        }
    )    
    
>> df2

     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

>> df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

 

참고) Categorical 함수를 사용하면 카테고리가 몇 개 있는지를 알려줄 뿐만 아니라 카테고리에 존재하지 않는 것들을 집어넣을 경우 에러가 뜨게 함.

>> cate = pd.Categorical(["test", "train", "test", "train"])
>> cate
['test', 'train', 'test', 'train']
Categories (2, object): ['test', 'train']

>> cate[0] = 'Test'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-46-469a102ee59a> in <cell line: 1>()
----> 1 cate[0] = 'Test'

/usr/local/lib/python3.10/dist-packages/pandas/core/arrays/_mixins.py in __setitem__(self, key, value)
    247     def __setitem__(self, key, value) -> None:
    248         key = check_array_indexer(self, key)
--> 249         value = self._validate_setitem_value(value)
    250         self._ndarray[key] = value
    251 

/usr/local/lib/python3.10/dist-packages/pandas/core/arrays/categorical.py in _validate_setitem_value(self, value)
   1295             return self._validate_listlike(value)
   1296         else:
-> 1297             return self._validate_scalar(value)
   1298 
   1299     def _validate_scalar(self, fill_value):

/usr/local/lib/python3.10/dist-packages/pandas/core/arrays/categorical.py in _validate_scalar(self, fill_value)
   1320             fill_value = self._unbox_scalar(fill_value)
   1321         else:
-> 1322             raise TypeError(
   1323                 \"Cannot setitem on a Categorical with a new \"
   1324                 f\"category ({fill_value}), set the categories first\"

TypeError: Cannot setitem on a Categorical with a new category (Test), set the categories first


>> cate_list = ["test", "train", "test", "train"]
>> cate_list
['test', 'train', 'test', 'train']

>> cate_list[1] = 'Test'
>> cate_list
['test', 'Test', 'test', 'train']

ㅇ Viewing data (데이터 확인 방법)

1. DataFrame.head(), DataFrame.tail() : 처음이나 맨 마지막 행들 조회 가능.

>> df.head(10)
		    0		    1		    2		    3
2024-05-13	1.845046	0.260883	0.410832	1.040944
2024-05-14	-0.061069	-0.259378	0.069027	1.239575
2024-05-15	-0.600211	-1.154017	0.890548	0.348339
2024-05-16	-1.469159	0.235009	-0.271018	1.404874
2024-05-17	0.222574	1.045386	0.162020	-0.675239
2024-05-18	-0.271808	0.900917	-1.394652	-0.257152

>> df.head()		# 기본 값 : 5개 출력
		    0		    1		    2		    3
2024-05-13	1.845046	0.260883	0.410832	1.040944
2024-05-14	-0.061069	-0.259378	0.069027	1.239575
2024-05-15	-0.600211	-1.154017	0.890548	0.348339
2024-05-16	-1.469159	0.235009	-0.271018	1.404874
2024-05-17	0.222574	1.045386	0.162020	-0.675239

>> df.tail()		# 기본 값 : 5개 출력
		    0		    1		    2		    3
2024-05-14	-0.061069	-0.259378	0.069027	1.239575
2024-05-15	-0.600211	-1.154017	0.890548	0.348339
2024-05-16	-1.469159	0.235009	-0.271018	1.404874
2024-05-17	0.222574	1.045386	0.162020	-0.675239
2024-05-18	-0.271808	0.900917	-1.394652	-0.257152

 

2. DataFrame.index, DataFrame.columns : 각각 index와 columns 정보 조회

>> df.index
DatetimeIndex(['2024-05-13', '2024-05-14', '2024-05-15', '2024-05-16',
               '2024-05-17', '2024-05-18'],
              dtype='datetime64[ns]', freq='D')

>> df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')

 

3. DataFrame.to_numpy() 함수 : index와 colum값 제외된 데이터만 Numpy array 형태로 변경

>> df.to_numpy()
array([[ 1.84504554,  0.26088255,  0.41083151,  1.04094389],
       [-0.06106884, -0.25937766,  0.06902654,  1.23957504],
       [-0.6002111 , -1.1540172 ,  0.89054796,  0.34833934],
       [-1.46915942,  0.23500853, -0.27101799,  1.40487427],
       [ 0.22257418,  1.04538586,  0.16201965, -0.67523941],
       [-0.27180771,  0.90091665, -1.39465245, -0.25715226]])

 

4. describe() 함수 : 데이터에 대한 통계 값 조회

>> df.describe()
	    A		    B		    C		    D
count	6.000000	6.000000	6.000000	6.000000
mean	0.223903	-0.316182	0.082094	-0.165044
std	0.288232	1.503504	0.829479	0.712710
min	0.002321	-2.836017	-1.022879	-1.264085
25%	0.025479	-0.948033	-0.448293	-0.380779
50%	0.065442	0.109938	0.029118	-0.223977
75%	0.458920	0.586081	0.688016	0.154944
max	0.607974	1.274945	1.156988	0.873570

 

5. 데이터 transpose (df.T : 전치행렬)

>> df.T
	2024-05-13	2024-05-14	2024-05-15	2024-05-16	2024-05-17	2024-05-18
0	1.845046	-0.061069	-0.600211	-1.469159	0.222574	-0.271808
1	0.260883	-0.259378	-1.154017	0.235009	1.045386	0.900917
2	0.410832	0.069027	0.890548	-0.271018	0.162020	-1.394652
3	1.040944	1.239575	0.348339	1.404874	-0.675239	-0.257152

>> df
     		   0		    1		    2		    3
2024-05-13	1.845046	0.260883	0.410832	1.040944
2024-05-14	-0.061069	-0.259378	0.069027	1.239575
2024-05-15	-0.600211	-1.154017	0.890548	0.348339
2024-05-16	-1.469159	0.235009	-0.271018	1.404874
2024-05-17	0.222574	1.045386	0.162020	-0.675239
2024-05-18	-0.271808	0.900917	-1.394652	-0.257152

 

6. DataFrame.sort_index() : 축(axis)을 기준으로 정렬

  * axis = 0 행방향 (default)
    axis = 1 열방향

>> df
		    A		    B		    C		    D
2024-05-13	0.335655	1.419948	-0.393526	-0.877976
2024-05-14	1.102611	-0.642272	0.736718	0.081000
2024-05-15	0.482039	2.128768	0.235904	-0.056823
2024-05-16	-0.765209	-1.463138	-0.184130	0.899178
2024-05-17	-0.823268	0.437651	-1.025236	-1.827536
2024-05-18	-0.414458	-1.092928	-0.256849	-2.232786

>> df.sort_index(axis = 0, ascending = False)
		    0		    1		    2		    3
2024-05-18	-0.414458	-1.092928	-0.256849	-2.232786
2024-05-17	-0.823268	0.437651	-1.025236	-1.827536
2024-05-16	-0.765209	-1.463138	-0.184130	0.899178
2024-05-15	0.482039	2.128768	0.235904	-0.056823
2024-05-14	1.102611	-0.642272	0.736718	0.081000
2024-05-13	0.335655	1.419948	-0.393526	-0.877976

>> df.sort_index(axis = 1, ascending = False)
		    3		    2		    1		    0
2024-05-13	-0.877976	-0.393526	1.419948	0.335655
2024-05-14	0.081000	0.736718	-0.642272	1.102611
2024-05-15	-0.056823	0.235904	2.128768	0.482039
2024-05-16	0.899178	-0.184130	-1.463138	-0.765209
2024-05-17	-1.827536	-1.025236	0.437651	-0.823268
2024-05-18	-2.232786	-0.256849	-1.092928	-0.414458

 

7. DataFrame.sort_values() : 특정 컬럼 기준 정렬

>> df.sort_values(by="B")
		    A		    B		    C		    D
2024-05-16	-0.765209	-1.463138	-0.184130	0.899178
2024-05-18	-0.414458	-1.092928	-0.256849	-2.232786
2024-05-14	1.102611	-0.642272	0.736718	0.081000
2024-05-17	-0.823268	0.437651	-1.025236	-1.827536
2024-05-13	0.335655	1.419948	-0.393526	-0.877976
2024-05-15	0.482039	2.128768	0.235904	-0.056823

>> df.sort_values(by="A")

		    A		    B		    C		    D
2024-05-17	-0.823268	0.437651	-1.025236	-1.827536
2024-05-16	-0.765209	-1.463138	-0.184130	0.899178
2024-05-18	-0.414458	-1.092928	-0.256849	-2.232786
2024-05-13	0.335655	1.419948	-0.393526	-0.877976
2024-05-15	0.482039	2.128768	0.235904	-0.056823
2024-05-14	1.102611	-0.642272	0.736718	0.081000

ㅇ Selection (데이터 추출하기)  ★★★★★

- 참고) 데이터 추출은 기존의 표현식으로 가능하지만, pandas에 최적화된 다음 함수 사용을 권장

- DataFrame.at(), DataFrame.iat(), DataFrame.loc(), DataFrame.iloc()

 

1. Getitem []
- DataFrame에서 column을 지정하여 데이터를 선택할 수 있고 이는 Series 형태로 반환됨.
- df.column과 동일

>> df
		    A		    B		    C		    D
2024-05-13	0.335655	1.419948	-0.393526	-0.877976
2024-05-14	1.102611	-0.642272	0.736718	0.081000
2024-05-15	0.482039	2.128768	0.235904	-0.056823
2024-05-16	-0.765209	-1.463138	-0.184130	0.899178
2024-05-17	-0.823268	0.437651	-1.025236	-1.827536
2024-05-18	-0.414458	-1.092928	-0.256849	-2.232786

>> df['A']
2024-05-13    0.335655
2024-05-14    1.102611
2024-05-15    0.482039
2024-05-16   -0.765209
2024-05-17   -0.823268
2024-05-18   -0.414458
Freq: D, Name: A, dtype: float64

>> df.A		# 권장하지 않음(특히 컬럼 내 띄어쓰기가 있는 경우 사용 불가)
2024-05-13    0.335655
2024-05-14    1.102611
2024-05-15    0.482039
2024-05-16   -0.765209
2024-05-17   -0.823268
2024-05-18   -0.414458
Freq: D, Name: A, dtype: float64

 

- DataFrame을 슬라이싱을 통해 추출할 수 있음.

>> df[0:3]
		    A		    B		    C		    D
2024-05-13	0.335655	1.419948	-0.393526	-0.877976
2024-05-14	1.102611	-0.642272	0.736718	0.081000
2024-05-15	0.482039	2.128768	0.235904	-0.056823

>> df["20240513":"20240515"]
		    A		    B		    C		    D
2024-05-13	0.335655	1.419948	-0.393526	-0.877976
2024-05-14	1.102611	-0.642272	0.736718	0.081000
2024-05-15	0.482039	2.128768	0.235904	-0.056823

ㅇ Selection by label

- DataFrame.loc()와 DataFrame.at()를 사용하여 추출

- 특정 row를 매칭해서 추출하는 경우

>> df.loc[dates[0]]		# 0번째 인덱스
A    0.335655
B    1.419948
C   -0.393526
D   -0.877976
Name: 2024-05-13 00:00:00, dtype: float64

 

- 선택한 column에서 모든 row를 추출하는 경우

>> df.loc[:, ["A", "B"]]
		    A		    B
2024-05-13	0.335655	1.419948
2024-05-14	1.102611	-0.642272
2024-05-15	0.482039	2.128768
2024-05-16	-0.765209	-1.463138
2024-05-17	-0.823268	0.437651
2024-05-18	-0.414458	-1.092928

 

- 슬라이싱으로 추출하는 경우

>> df.loc["20240513":"20240515", ["A", "B"]]
		    A		    B
2024-05-13	0.335655	1.419948
2024-05-14	1.102611	-0.642272
2024-05-15	0.482039	2.128768

 

- 1개의 행과 1개의 column을 고르게 되면 하나의 값만 추출

>> df.loc[dates[0], 'A']
0.33565539836551156

>> df.loc['2024-05-13', 'A']
0.33565539836551156

 

- 위의 방법보다 더 빠르게 하나의 값을 추출

>> df.at[dates[0], 'A']
0.33565539836551156

>> df.at['2024-05-13', 'A']
0.33565539836551156

ㅇ Selection by position

 - DataFrame.iloc() or DataFrame.iat()를 사용하여 데이터를 추출

- 정수로 위치를 지정하여 데이터 추출

>> df.iloc[3]
A   -0.765209
B   -1.463138
C   -0.184130
D    0.899178
Name: 2024-05-16 00:00:00, dtype: float64

 

- 슬라이싱으로 데이터 추출

>> df.iloc[3:5, 0:2]
		    A		    B
2024-05-16	-0.765209	-1.463138
2024-05-17	-0.823268	0.437651

>> df.iloc[3:5]
		    A		    B		    C		    D
2024-05-16	0.002321	-2.836017	0.341215	0.251561
2024-05-17	0.098728	0.138162	-0.503397	-0.313048

>> df[3:5]		# 기존 슬라이싱
		    A		    B		    C		    D
2024-05-16	0.002321	-2.836017	0.341215	0.251561
2024-05-17	0.098728	0.138162	-0.503397	-0.313048

 

- 리스트로 위치를 지정하여 데이터 추출

>> df.iloc[[1, 2, 4], [0, 2]]
		    A		    C
2024-05-14	1.102611	0.736718
2024-05-15	0.482039	0.235904
2024-05-17	-0.823268	-1.025236

 

- column만 분할하고자 하는 경우

>> df.iloc[1:3, :]
		    A		    B		    C		    D
2024-05-14	0.607974	0.735388	1.156988	-0.403357
2024-05-15	0.578984	1.274945	-0.282980	-0.134906

 

- row만 분할하고자 하는 경우

>> df.iloc[:, 1:3]
		    B		    C
2024-05-13	-1.291282	-1.022879
2024-05-14	0.735388	1.156988
2024-05-15	1.274945	-0.282980
2024-05-16	-2.836017	0.341215
2024-05-17	0.138162	-0.503397
2024-05-18	0.081714	0.803616

 

- 값 하나만 추출하는 경우

>> df.iloc[1, 1]
-0.6422716201341141

 

- 위 방법보다 더 빠르게 값을 하나만 추출

>> df.iat[1, 1]
-0.6422716201341141

ㅇ Boolean indexing  ★★★★★

- A 값들이 0.1보다 큰 데이터들을 선택

>> df[df["A"] > 0.1]
		    A		    B		    C		    D
2024-05-13	0.335655	1.419948	-0.393526	-0.877976
2024-05-14	1.102611	-0.642272	0.736718	0.081000
2024-05-15	0.482039	2.128768	0.235904	-0.056823

 

- 조건에 충족하는 데이터 프레임 값들을 표시

>> df[df > 0]
		    A		    B		    C		    D
2024-05-13	0.335655	1.419948	NaN		NaN
2024-05-14	1.102611	NaN		0.736718	0.081000
2024-05-15	0.482039	2.128768	0.235904	NaN
2024-05-16	NaN		NaN		NaN		0.899178
2024-05-17	NaN		0.437651	NaN		NaN
2024-05-18	NaN		NaN		NaN		NaN

>> df
		    A		    B		    C		    D
2024-05-13	0.335655	1.419948	-0.393526	-0.877976
2024-05-14	1.102611	-0.642272	0.736718	0.081000
2024-05-15	0.482039	2.128768	0.235904	-0.056823
2024-05-16	-0.765209	-1.463138	-0.184130	0.899178
2024-05-17	-0.823268	0.437651	-1.025236	-1.827536
2024-05-18	-0.414458	-1.092928	-0.256849	-2.232786

 

- isin()을 사용하여 필터링이 가능

>> df2 = df.copy()
>> df2
		    A		    B		    C		    D
2024-05-13	0.335655	1.419948	-0.393526	-0.877976
2024-05-14	1.102611	-0.642272	0.736718	0.081000
2024-05-15	0.482039	2.128768	0.235904	-0.056823
2024-05-16	-0.765209	-1.463138	-0.184130	0.899178
2024-05-17	-0.823268	0.437651	-1.025236	-1.827536
2024-05-18	-0.414458	-1.092928	-0.256849	-2.232786

>> df2["E"] = ["one", "one", "two", "three", "four", "three"]
>> df2
		    A		    B		    C		    D		 E
2024-05-13	0.335655	1.419948	-0.393526	-0.877976	one
2024-05-14	1.102611	-0.642272	0.736718	0.081000	one
2024-05-15	0.482039	2.128768	0.235904	-0.056823	two
2024-05-16	-0.765209	-1.463138	-0.184130	0.899178	three
2024-05-17	-0.823268	0.437651	-1.025236	-1.827536	four
2024-05-18	-0.414458	-1.092928	-0.256849	-2.232786	three

>> df2['E'].isin(['two','four'])
2024-05-13    False
2024-05-14    False
2024-05-15     True
2024-05-16    False
2024-05-17     True
2024-05-18    False
Freq: D, Name: E, dtype: bool

>> df2[df2['E'].isin(['two','four'])]
		    A		    B		    C		    D		 E
2024-05-15	0.482039	2.128768	0.235904	-0.056823	two
2024-05-17	-0.823268	0.437651	-1.025236	-1.827536	four

 

 

 

* 참조 : 공식 10 minutes to pandas — pandas 2.2.2 documentation (pydata.org)

 

10 minutes to pandas — pandas 2.2.2 documentation

10 minutes to pandas This is a short introduction to pandas, geared mainly for new users. You can see more complex recipes in the Cookbook. Customarily, we import as follows: In [1]: import numpy as np In [2]: import pandas as pd Basic data structures in p

pandas.pydata.org

 

Comments