[SSL] Introduction to semi-supervised learning

Supervised learning and unsupervised learning

머신러닝에서는 전통적으로 두 가지 주요한 문제, supervised learning(지도학습)과 unsupervised learning(비지도 학습)가 있다.

어떤 input $x$에 대응하는 output value $y$로 이루어진 데이터 세트로 학습한다.
주요한 목적은 classifier나 regressor를 만들어서 이전에 본 적 없는 input에 대한 output value를 추정하는 것이다.

Supervised learning과는 다르게 output value가 주어지지 않는다.
그 대신 입력으로부터 어떤 기본 구조를 추론한다.
예를 들어, unsupervised clustering에서는 주어진 inputs을 비슷한 input끼리 같은 그룹으로 mapping하는 것이다.

Semi-supervised learning은 supervised learning과 unsupervised learning을 하나로 합치려는 머신러닝의 한 갈래다.
서로 다른 두 task에서 연결된 정보를 이용하여 성능을 향상시키고자 한다.
ex) 분류 문제에서 output이 없는(label이 없는) 추가적인 데이터를 이용한다. 클러스터링 문제에서 같은 클래스에 속한다고 이미 알려진 정보를 활용한다.

많은 semi-supervised learning research는 classification 문제를 해결하는 것에 집중했다.
Semi-supervised classification 문제는 라벨링이 된 데이터가 부족한 경우 사용된다. 라벨링 된 데이터가 부족하면 신뢰도 있는 supervised classifier를 만들 수 없기 때문이다.
보통 이런 상황은 라벨링된 데이터를 얻기 힘들거나 비용이 비싼 응용 분야에서 사용되었다. ex) 컴퓨터 기반 진단, 약물 탐지, part-of speech tagging 등
만약 unlabelled data가 충분하고, 데이터의 분포에 대한 가정 하에서 unlabelled data는 더 좋은 classifier를 만드는데 도움이 된다.
실제로는 라벨링된 데이터가 부족하지 않더라도 unlabelled data가 예측에 추가적인 정보를 제공할 수 있다면 더 좋은 classifier를 만드는데 도움이 된다.

독자에게 semi-supervised learning의 연구 분야에서 최근의 연구와 발전, 그리고 주요 알고리즘과 접근법에 대한 설명을 포함하여 개요를 제공한다.
Semi-supervised classification 방법론의 새로운 taxonomy를 제시해서 서로가 어떤 가정을 공유하는지, 현존하는 supervised method와 어떤 연관이 있는지 밝힌다.
서로 다른 방법론에 대한 이해와 방법론의 연결의 이해를 돕는 새로운 관점을 제공한다.
Semi-supervised learning이 어떤 가정에 근거하고 있는지 설명한다.

Semi-supervised learning의 기본적인 개념과 가정, 클러스터링과의 관계

Semi-supervised learning의 taxonomy

Inductive methods Sect. 4 : wrapper method Sect. 5 : unsupervised preprocessing Sect. 6 : intrinsically semi-supervised method

transductive method

Semi-supervised regression Semi-supervised clustering

Semi-supervised learning의 전망