Department of CSE
Kattankulathur. [email protected]
Department of CSE
Kattankulathur. [email protected]
Department of CSE
In this era of computerization, education
has also revamped itself and is no more limited to the old lecture methods. The
quest is on to find new ways to make it more efficient and to make students
efficient. These days, a lot of data is collected in educational databases, but
it remains collected in educational databases, but it remains unutilized. In
order to make proper use of such a large amount of data, powerful tools are
required. It is very important to study and analyze educational data to help
& improvise the students. Educational Data Mining (EDM) is an emerging field
exploring data in educational context by applying different Data Mining (DM) techniques/tools.
It provides intrinsic knowledge of teaching and learning process for effective
educational planning. This paper presents a comprehensive survey, a travelogue
towards educational data mining & its scope in future.
the last decade, the number of education universities/institutions have
proliferated manifolds. Large number of graduates/post graduates are produced
by them every year. Universities/Institutes may follow best of the pedagogies;
but still they face the problem of dropout students, low achievers and
Data Mining (EDM) is an emerging field exploring data in educational context by
applying different Data Mining (DM) techniques/tools. EDM inherits properties
from areas like Learning Analytics, Psychometrics, Artificial Intelligence,
Information Technology, Machine learning, Statics, Database Management System,
Computing and Data Mining. It can be considered as interdisciplinary research
field which provides intrinsic knowledge of teaching and learning process for
Data Mining is a new trend in the data mining and Knowledge Discovery in
Databases (KDD) field which focuses in mining useful patterns and discovering
useful knowledge from the educational information systems, such as, admissions
systems, registration systems, course management systems (Moodle, blackboard, etc.),
and any other systems dealing with students at different levels of education,
from schools, to colleges and universities. Researchers in this field focus on
discovering useful knowledge either to help the educational institutes manage
their students better, or to help students to manage their education and
deliverables better and enhance their performance.
and analyzing the factors for poor performance is a complex and incessant
process hidden in past and present information congregated from academic
performance and students’ behavior. Powerful tools are required to analyze and
predict the performance of students scientifically.
universities/institutions collect an enormous number of students’ data, but
this data remains unutilized and does not help in any decisions or policy
making to improve the performance of students.
Universities could identify the factors for low performance earlier and is able
to predict students’ behavior, this knowledge can help them in taking
pro-active actions, so as to improve the performance of such students. It will
be a win-win situation for all the stakeholders of universities/institutions
i.e. management, teachers, students and parents. Students will be able to
identify their weaknesses beforehand and can improve themselves. Teachers will
be able to plan their lectures as per the need of students and can provide
better guidance to such students. Parents will be reassured of their ward
performance in such institutes. Management can bring in better policies and
strategies to enhance the performance of these students with additional
facilities. Eventually, this will help in producing skillful workforce and hence
sustainable growth for the country.
and Pal conducted a research on a group of 50 students enrolled in a specific
course program across a period of 4 years (2007-2010), with multiple
performance indicators, including “Previous Semester Marks”, “Class Test
Grades”, “Seminar Performance”, “Assignments”, “General Proficiency”,
“Attendance”, “Lab Work”, and “End Semester Marks”. They used ID3 decision tree
algorithm to finally construct a decision tree, and if-then rules which will
eventually help the instructors as well as the students to better understand
and predict students’ performance at the end of the semester. Furthermore, they
defined their objective of this study as: “This study will also work to
identify those students which needed special attention to reduce fail ration
and taking appropriate action for the next semester examination”.
Abeer and Elaraby
conducted a similar research that mainly focuses on generating classification
rules and predicting students’ performance in a selected course program based
on International Journal of Advanced Computer Science and Applications,
previously recorded students’ behavior and activities. Abeer and Elaraby
processed and analyzed previously enrolled students’ data in a specific course
program across 6 years (2005–10), with multiple attributes collected from the
university database. As a result, this study was able to predict, to a certain
extent, the students’ final grades in the selected course program, as well as,
“help the students to improve the student’s performance, to identify those
students which needed special attention to reduce failing ration and taking
appropriate action at right time”.
Bhardwaj and Pal
conducted a significant data mining research using the Naïve Bayes
classification method, on a group of BCA students. A questionnaire was conducted
and collected from each student before the final examination, which had
multiple personal, social, and psychological questions that was used in the
study to identify relations between these factors and the student’s performance
and grades. Bhardwaj and Pal identified their main objectives of this study as:
q Generation of a data
source of predictive variables
q Identification of
different factors, which effects a student’s learning behavior and performance
during academic career
q Construction of a
prediction model using classification data mining techniques on the basis of
identified predictive variables
q Validation of the
developed model for higher education students studying in Indian Universities
They found that the most
influencing factor for student’s performance is his grade in senior secondary
school, which tells us, that those students who performed well in their
secondary school, will definitely perform well in their Bachelors study.
Furthermore, it was found that the living location, medium of teaching,
mother’s qualification, student other habits, family annual income, and student
family status, all of which, highly contribute in the students’ educational
performance, thus, it can predict a student’s grade or generally his/her
performance if basic personal and social knowledge was collected about him/her.
Baker and Yacef
describes the following four goals of EDM:
q Predicting student’s
future learning behavior
q Discovering or improving
q Studying the effects of
q Advancing scientific
knowledge about learning and learners
student’s future learning behavior – With the use of student modeling, this goal can be
achieved by creating student models that incorporate the learner’s
characteristics, including detailed information such as their knowledge,
behaviors and motivation to learn.
or improving domain models
– Through the various methods and applications of EDM, discovery of new and
improvements to existing models is possible.
the effects of educational support – It can be achieved through
learning systems. Advancing scientific
knowledge about learning and learners – By building and incorporating student
models, the field of EDM research and the technology and software used
MINING DEFINITION AND TECHNIQUES
Data mining refers to extracting or
“mining” knowledge from large amounts of data. Data mining techniques are
used to operate on large volumes of data to discover hidden patterns and
relationships helpful in decision making.
The various techniques used in Data Mining
Association analysis is the discovery of
association rules showing attribute-value conditions that occur frequently
together in a given set of data. Association analysis is widely used for market
basket or transaction data analysis.
In prediction, the goal is to develop a
model which can infer a single aspect of data from some combination of other
aspects of data. If we study prediction extensively then we get three types of
prediction: classification, regression and density estimation. In any category
of prediction, the input variables will be either categorical or continuous.
Classification is the processing of
finding a set of models (or functions) which describe and distinguish data
classes or concepts, for the purposes of being able to use the model to predict
the class of objects whose class label is unknown.
q Clustering Analysis
Unlike classification and predication,
which analyze class labeled data objects, clustering analyzes data objects
without consulting a known class label. In general, the class labels are not
present in the training data simply because they are not known to begin with.
Clustering can be used to generate such labels. The objects are clustered or
grouped based on the principle of maximizing the intraclass similarity and
minimizing the interclass similarity.
That is, clusters of objects are formed so
that objects within a cluster have high similarity in comparison to one
another, but are very dissimilar to objects in other clusters. Each cluster
that is formed can be viewed as a class of objects, from which rules can be
Naive Bayes: classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a
single algorithm but a family of algorithms where all of them share a common
principle, i.e. every pair of features being classified is independent of each
Naïve model is the default model that
predicts the classes of all examples in a dataset as the class of its mode
(highest frequency). For example, let’s consider a dataset of 100 records and 2
classes (Yes & No), the “Yes” occurs 75 times and “No” occurs 25 times, the
default model for this dataset will classify all objects as “Yes”, hence, its accuracy
will be 75%. Even though it is useless, but equally important, it allows to
evaluate the accuracies produced by other classification models. This concept
can be generalized to all classes/labels in the data to produce an expectation
of the class recall as well.
& FUTURE WORK
Data mining is a tremendously vast area
that includes employing different techniques and algorithms for pattern
finding. The algorithms discussed in this paper are the ones used in education
mining. These algorithms have shown a remarkable improvement in strategies like
course outline formation, teacher student understanding and high output and
turn out ratio. ICDM conference encourages employment and development of
algorithms helpful in data mining. An appreciable research is still being done
on various algorithms.
Prediction with data mining has reaped
benefits; such as finding set of weak students, determining student’s
satisfaction for a particular course, Faculty Evaluation, Comprehensive student
evaluation, Class room teaching language selection, predicting students’
dropout, course registration planning, predicting the enrollment headcount,
evaluation of collaborative activities etc.
of the most recent and biggest challenge that higher education faces today is
making students skillfully employable. Many universities/institutes are not in
position to guide their students because of lack of information and assistance
from their teaching-learning systems. To better administer and serve student
population, the universities/institutions need better assessment, analysis, and
q Nat’l Research Council,
Building a Workforce for the Information Economy, Nat’l Academies Press, 2001.
q C. Romero, S. Ventura, and E. Garca,”Data
Mining in Course Management Systems: Moodle Case Study and Tutorial,” Computers
& Education, vol. 51, no. 1, 2008, pp. 368–384.
q . L. Pappano, “The Year
of the MOOC,”The New York Times, 2 Nov. 2012;
q Z. Pardos et al., “Adapting Bayesian Knowledge
Tracing to a Massive Open Online Course in edX,” Proc. 6th Int’l Conf.
Educational Data Mining (EDM 13), 2013; www.educational
q A. Elbadrawy, R.S. Studham, and G. Karypis,
“Collaborative Multiregression Models for Predicting Students’ Performance in
Course Activities,” Proc. 5th Int’l Conf. Learning Analytics and Knowledge (LAK
15), 2015, pp. 103–107.