R에서 연속형 변수와 범주형 변수가 섞여있어도 missForest imputation이 한 줄로 되는데, python에서는 은근 까다롭다.
import numpy as np
import pandas as pd
from missingpy import MissForest
RS = 100
일단 필요한 라이브러리 로드. missingpy 패키지를 이용
data = pd.read_csv('inhadr.csv', na_values=['-100', '-1000'])
del data["ID"] #patient id not needed
del data['AST'] # should be blinded for finding DILI patients
del data['ALT'] # should be blinded for finding DILI patients
data['SEX'] = data.SEX.astype('category')
data['GENOTYPE_RESULT'] = data.GENOTYPE_RESULT.astype('category')
성별과 유전형이 category이다.
%%time
# get indices of categorical features
cat_cols = [data.columns.get_loc(col) for col in data.select_dtypes(['category']).columns.tolist()]
# missForest imputation
imputer = MissForest(random_state=RS)
imputed = imputer.fit_transform(data, cat_vars=cat_cols)
범주형인 컬럼의 번호를 리스트 형태로 전달해줘야 한다. R에서는 웬만하건 다 이름 index로도 되는데 왜 꼭 번호를…
imputed = pd.DataFrame(imputed, columns=data.columns.tolist())
for col in cat_cols:
imputed.iloc[:,col] = imputed.iloc[:,col].astype('category')
imputed.describe(include='all')
내가 잘못한건지 모르겠는데, imputation이 끝나고 나면 dtype이 다 float64로 변환된다. 다시 category 형태로 변경
More from my site
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Hi!
I tried your code on 2 of my datasets and still get the error related to categorical vars: “could not convert string to float: ‘RL”. I wonder what I’m missing:
from missingpy import MissForest
imputer = MissForest()
imputed = imputer.fit_transform(X_train, cat_vars= [X_train.columns.get_loc(col) for col in X_train.select_dtypes([‘category’]).columns.tolist()])
imputed = pd.DataFrame(imputed, columns=X_train.columns.tolist())
for col in catColIndices:
imputed.iloc[:,col] = imputed.iloc[:,col].astype(‘category’)
imputed.describe(include=’all’)
Kindly help me out.
Thank you!
Hi Abhilash,
According to your error message, there seems to be some error while converting a string to float.
I assume your categorical values (such as RL) are not using numeric coding.
In my case, SEX and GENOTYPE_RESULT were entered as {0, 1} and {0, 1, 2, 3}
Try using numeric coding, not raw text in your categorical values.