python missForest imputation for mixed categorical and continuous features example

R에서 연속형 변수와 범주형 변수가 섞여있어도 missForest imputation이 한 줄로 되는데, python에서는 은근 까다롭다.

import numpy as np
import pandas as pd
from missingpy import MissForest
RS = 100

일단 필요한 라이브러리 로드. missingpy 패키지를 이용

data = pd.read_csv('inhadr.csv', na_values=['-100', '-1000'])
del data["ID"] #patient id not needed
del data['AST'] # should be blinded for finding DILI patients
del data['ALT'] # should be blinded for finding DILI patients
data['SEX'] = data.SEX.astype('category')
data['GENOTYPE_RESULT'] = data.GENOTYPE_RESULT.astype('category')

성별과 유전형이 category이다.

%%time
# get indices of categorical features
cat_cols = [data.columns.get_loc(col) for col in data.select_dtypes(['category']).columns.tolist()]
# missForest imputation
imputer = MissForest(random_state=RS)
imputed = imputer.fit_transform(data, cat_vars=cat_cols)

범주형인 컬럼의 번호를 리스트 형태로 전달해줘야 한다. R에서는 웬만하건 다 이름 index로도 되는데 왜 꼭 번호를…

imputed = pd.DataFrame(imputed, columns=data.columns.tolist())
for col in cat_cols:
    imputed.iloc[:,col] = imputed.iloc[:,col].astype('category')
imputed.describe(include='all')

내가 잘못한건지 모르겠는데, imputation이 끝나고 나면 dtype이 다 float64로 변환된다. 다시 category 형태로 변경

CC BY-NC-SA 4.0 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

2 Comments

  • Abhilash says:

    Hi!

    I tried your code on 2 of my datasets and still get the error related to categorical vars: “could not convert string to float: ‘RL”. I wonder what I’m missing:

    from missingpy import MissForest
    imputer = MissForest()
    imputed = imputer.fit_transform(X_train, cat_vars= [X_train.columns.get_loc(col) for col in X_train.select_dtypes([‘category’]).columns.tolist()])

    imputed = pd.DataFrame(imputed, columns=X_train.columns.tolist())
    for col in catColIndices:
    imputed.iloc[:,col] = imputed.iloc[:,col].astype(‘category’)
    imputed.describe(include=’all’)

    Kindly help me out.

    Thank you!

    • mahler83 says:

      Hi Abhilash,
      According to your error message, there seems to be some error while converting a string to float.
      I assume your categorical values (such as RL) are not using numeric coding.
      In my case, SEX and GENOTYPE_RESULT were entered as {0, 1} and {0, 1, 2, 3}
      Try using numeric coding, not raw text in your categorical values.

Leave a Comment

Time limit is exhausted. Please reload CAPTCHA.