Parallel an disproportion among classes i.e. instances

Parallel
Technique for Balancing Medical Data

Abstract

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

There can be various  aspects that can effect performance of a
machine learning classifier ,among which Unbalanced Dataset is the most
prominent .Unbalanced dataset is dataset in which there in an disproportion
among classes i.e. instances belonging to one class heavily out number instances
belonging to all other classes. This problem of unbalanced dataset is more common
in medical data as it is collected from real world where number of persons
affected by disease will always be less than the non affected persons. Due to
this disproportion among the classes classifiers face difficulties in learning
concepts related to class in minority. In this work we discuss some of the top
data balancing techniques. Most of all data balancing techniques are created
keeping general data in mind and are not viable for medical data. In this paper
a method is proposed that helps balance medical data more effectively and at
the same time increase performance and decrease the leaning time for
classifier.

Introduction

Data Unbalancing is a typical problem in field of supervised
classification. The aim of supervised classification is to categorize data
points that are unknown considering a given set of known data points. Application
of supervised classifier over imbalanced data has become a matter of interest
in the recent years. Unbalanced dataset is common among real word situation
such as diagnosing gear faults 1, in diagnosis of medical data 2, detecting
network intrusion 3, 4, classification of text 5, 6, detecting financial
statement fraud 7 classification of streams of data 8. These real life
classification problems mostly consist of a majority class that has
significantly larger no of instances than other minority classes and generally
these minority classes depict occurrence of an event and are more relevant than
majority class. Due to data unbalancing machine learning based classifiers face
problems in learning minority classes as the weightage for misclassification of
a data point belonging to minority class is very less than that of majority
class. As a result most machine learning classifiers become biased towards
classifying instances in majority class and tend to consider minority class
instances as outliers. This type of classification prevents the system to be
used in real case scenarios 9, 10.

 

 Most researches in previous
years have been focused on classification of unbalanced binary data 11. Problem
of unbalanced datasets become more crucial in case of multi class
classification problems where people tried to convert multi class data into
binary by selecting smallest class as minority class and merging all other
classes in to one majority class 12, 13. These approaches face difficulty in
selecting artificial minority class as two or more classes may have similar
number of data points. Many approaches try synthetic dataset creation to increase
minority data using methods like Smote 14 but oversampling minority class by
adding synthetic data points in not considered a novel approach in context to
medical data as it raises a question on authenticity of data.

 

In this paper parallel data division based method is proposed
for efficient data balancing of medical data. This approach helps balance instance
among majority and minority class utilizing each and every valuable data
instances in the dataset. Without the creation of any synthetic data for
oversampling the minority classes as well as loosing valuable data by using
under sampling techniques on majority class this approach proves to be very
efficient in un-biasing the classifier, increasing the performance of system
and at the same time reducing the training time of system.     

 

 

Background

 

There are many approaches proposed to solve the problem of
unbalanced learning. To have a deep understanding these approaches can be
generalized as follows

Random Sampling

            Random Over Sampling:- Random oversampling is a
method that tries to balance disproportion among the classes by using a non-heuristic
function to randomly replicate instances of minority class. Main drawback of
random oversampling is that it replicates instances of minority class without
any alteration which increases the chances of occurrence of over fitting.

            Random Under Sampling:- Random Under sampling
is a method that tries to balance disproportion among the classes by using a
non-heuristic function to randomly eliminate instances of  majority class. Main shortcoming of random
under sampling is that it effects the induction process by potentially
discarding valuable data.

 

Tomek links

            It is an approach used for Under sampling in which Tomek
links15 belonging to majority class are eliminated. let Ia, Ib
are instances that belong to two different classes, (Ia,Ib)
is a Tomek Link if there doesn’t exist any instance Ic for which D(Ia
, Ic)