Received Date: December 13, 2016; Accepted Date: March 27, 2017; Published Date: March 29, 2017
Citation: Assari R, Azimi P, Taghva MR (2017) Heart Disease Diagnosis Using Data Mining Techniques. Int J Econ Manag Sci 6: 415. doi: 10.4172/2162-6359.1000415
Copyright: © 2017 Assari R, et al.. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at International Journal of Economics & Management Sciences
In recent decades, heart disease has been identified as the leading cause of death across the world. However, it is considered as the most preventable and controllable disease at the same time. According to World Health Organization (WHO), the early and timely diagnosis of heart disease plays a remarkable role in preventing its progress and reducing related treatment costs. Considering the ever-increasing growth of heart disease-induced fatalities, researchers have adopted different data mining techniques to diagnose it. According to results, application of the same data mining techniques leads to different results in different datasets. This study tries to assist healthcare specialists to early diagnose heart disease and assess related risk factors. To this end, the main heart disease diagnosis indices were identified using experts’ opinions. Then, data mining techniques were applied on a heartrelated dataset. Finally, the main heart disease diagnosis indices were identified and a model was developed based on extracted rules. Visual Studio was used to write the algorithm code.
Bayesian network; Data mining; Decision tree; Heart disease; K-nearest neighbor; Support vector machines
In the past decade, heart disease has been the leading cause of death in different continents and countries in the world, regardless of the income level of countries . According to WHO report, heart disease is the leading cause of death across the world, accounting for 7.2 million deaths, i.e., 12.8% of all fatalities in the world . Figure 1, illustrates deaths from heart disease across the world (scale: 1:100000). According to recent research predictions, cardiovascular diseases will become the leading cause of death up to 2030.
Although cardiovascular diseases have been identified as the leading cause of death in the world in the past decade, they have been introduced as the most preventable and controllable diseases . The complete and correct treatment of a disease depends on the timely diagnosis of that disease . An accurate and systematic tool for identifying high-risk patients and extracting data for timely diagnosis of heart disease seems a critical need.
Every day, modern computer-based systems collect large amounts of data using automatic data record systems in different fields. Data mining technology is the product of the evolution of database technology, IT and storage devices . The current challenge is to make data mining and knowledge discovery systems applicable to a wider range of domains . Researchers are adopting data mining techniques to diagnose different diseases including diabetes , stroke , cancer  and heart disease . Considering the high rate of cardiovascularinduced fatalities, researchers have tried to adopt data mining systems to diagnose heart disease .
Every day, modern computer-based systems collect large amounts of data using automatic data record systems in the healthcare field where data mining can extract a valuable knowledge from them. The next section briefly explains heart disease and the application of data mining techniques in treating such diseases.
As the leading cause of death in the world, heart disease, according to WHO, accounts for 3.8 million and 3.4 million deaths in males and females, respectively.
The symptoms and incidence of heart disease differ from one person to another. However, they commonly include chest pain, jaw pain, neck pain, back pain, stomach disorders; arms and shoulders pains and shortness of breath . Different heart problems induce different heart diseases including coronary artery disease, heart failure and stroke .
Although heart disease has been identified as the most chronic disease across the world, it is the most preventable one at the same time. A healthy life style (primary prevention) and timely diagnosis (secondary prevention) are two main elements of heart disease control. Conducting regular check-ups (secondary prevention) plays a remarkable role in the diagnosis and early prevention of heart disease complications . Several tests including, chest X-rays, angiography, echocardiography and exercise tolerance test contribute to this important issue. However, these tests are costly and require accurate medical equipment.
Applications of data mining to healthcare data
Data mining scholars have long studied the application of tools and equipment in improving the process of data analysis in large and complex datasets. Adopting data mining techniques in the medicine field is of high importance in diagnosing, predicting and deeply understanding of healthcare data. These applications include treatment centers analysis aimed at improving treatment policies and prevention of any mistake in hospitals, early diagnosis of diseases, prevention of diseases and hospital death reduction.
Heart specialist’s record and store large amounts of patients’ data. This provides a great opportunity for extracting a valuable knowledge from such datasets. Researchers are adopting statistical approaches as well as data mining techniques to help treatment and healthcare specialists diagnose and determine heart disease risk factors in patients. Statistical analyses have identified a number of risk factors for heart diseases including age, blood pressure, smoking , total cholesterol , diabetes , and hypertension, heart disease background in family, obesity and lack of physical activity . The awareness of heart disease risk factors assists treatment and healthcare specialists to identify patients who are subject to high risk factors.
Researchers have employed different data mining techniques to help specialists and physicians diagnose heart disease . Some techniques are more common such as Naïve Bayes, decision tree and K-nearest neighbor. However, there are other classification-based data mining techniques such as kernel density, neural network, bagging algorithm, sequential minimal optimization, direct Kernel selforganizing map and support vector machine. The next section briefly explains those techniques which were used in this study.
There are different types of decision trees. They only differ in the mathematical model they use to select the class of attribute during rule extraction. Gain ratio decision tree is the most common, successful type . It is a relationship between entropy (information gain) and classified information.
In entropy technique, the attribute which minimizes entropy and maximizes information gain is selected as the tree root. To select tree root, it is first necessary to calculate the information gain of each attribute. Then, the attribute maximizing information gain should be selected. Information gain, or entropy, is derived from relation 1.3 .
Where k is the number of response variable classes, pi is the ratio of the number of the ith class events to total number of samples (occurrence probability of i)
Bayesian network is a statistical technique predicting the membership class of the studied sample using the probability theory. Bayesian network practices classification process in accordance with Bayes’ theorem. It assumes that the influence of the value of a theorem on a class is independent from the influence of other attributes. This assumption is called “class conditional independence”. This assumption was made to simplify engaged calculation and this is why it was named “Naïve”, i.e., simple.
This technique calculates the prior probability of the response variable and the conditional probability of other variables. The prior and conditional probabilities of the initial training are calculated. Then, for every test dataset sample, the probability of the occurrence (presence) of each case of response variable is calculated. Afterwards, the response variable with the highest occurrence probability is selected. The probability of test sample for the response variable value is derived from relation 4.3 .
Where V, ci, aj and vj are test sample, response variable value, data attribute and the test sample value, respectively.
If a is the first sample denoted by (a1, a2,…, an), and b is the second sample denoted by (b1, b2,…, bn), the distance between them is calculated by relation 2-3.
Support vector machine
Given availability of support vectors, Support Vector Machine (SVM) is the boundary determining the best data classification and separation. In SVM, only those data lying inside support vectors are used as the base data for machine and building a model. This means that this algorithm is not sensitive to other data. It aims to find the best data boundary with the farthest possible distance from all classes (their support vectors). SVM transfers data to a new space with respect to their predetermined classes so that data can be classified and separated linearly (using hyperplanes). Then, it searches for support lines (or support planes in multi-dimensional space) and tries to determine the equation of a straight line that maximizes the distance between each two classes. Each support vector is characterized with an equation describing the boundary line of each class.
This study used Cleveland Clinic Foundation dataset known as “Cleveland Clinic Foundation Heart Disease Dataset”. This dataset included 13 attributes (Table 1) and 303 samples, 3 of which were incomplete and hence excluded from this study. The continuous values of this dataset were discretized using equal frequency method. This technique classifies continuous values into 5 classes.
|Age||Continuous||Age in years|
|Cp||Discrete||Chest pain type:|
|Trestbps||Continuous||Resting blood pressure in (mm Hg)|
|Chol||Continuous||Serum cholesterol in (mg/dl)|
|Restecg||Discrete||Resting electrocardiographic results:|
|1-having ST-T wave abnormality|
|2=showing probable or defined left ventricular hypertrophy|
|Thalach||Continuous||Maximum heart rate achieved|
|Old peak ST||Continuous||Depression induced by exercise relative to rest|
|Slope||Discrete||The slope of the peak exercise segment:|
|Ca||Discrete||Number of major vessels colored by fluoroscopy that range between 0 and 3|
|2=patient who is subject to possible heart disease|
Table 1: Cleveland clinic foundation heart disease dataset indices.
Selected data mining techniques were applied on the studied dataset after dataset discretization. 10-fold cross-validation method was used to validate the results. This technique classifies dataset into 10 portions. 9 portions were used for training the algorithm and 1 portion was used for evaluation in each run-time. The process was repeated 10 times. This procedure is helpful specifically in datasets with small number of samples by prevention of over-fitting. Finally, the sensitivity, specificity and accuracy of each method were calculated.
Sensitivity is the ratio of true positives. Specificity is the ratio of true negatives. Accuracy is the ratio of true positives and true negatives combined (relations 2.4 to 2.6).
Sensitivity=True Positive/Positive (2-4)
Specificity=True Negative/Negative (2-5)
Accuracy=(True Positive + True Negative)/(Positive + Negative) (2-6)
This section compares the accuracy, sensitivity and specificity of the employed techniques in terms of their confusion matrices.
Decision tree, Naïve Bayes, K-Nearest Neighbor and Support Vector Machine were applied to the studied dataset. Table 2 shows the sensitivity, specificity and accuracy of these data mining techniques. Their accuracy ranges from 79% to 84.33%. According to Table 2, SVM achieved the highest accuracy (84.33%).
Table 2: Results of employed data mining techniques.
SVM and Naïve Bayes achieved the highest accuracy, followed by KNN (k=7 resulted in the best accuracy as compared to other values) and decision tree, respectively. Weka and IBM SPSS Modeler were used to implement data mining techniques.
Identification of influential factors
After implementing the above mentioned techniques and performing related analyses, the following influential indices were identified. Table 3 compares all 13 input attributes in terms of their significance based on the results obtained from each technique. The variables thal (0.187), ca (0.185) and cp (0.152) had the highest significance in all 4 employed techniques respectively, followed by trestbps (0.03) and chol (0.032) as the lowest. Figure 2 compares the attributes in terms of significance.
|Naïve Bayes||KNN K=7||SVM||Decision Tree|
Table 3: Significance comparison of input indices in each method.
Heart disease is the leading cause of death across the world. It accounts for 7.2 million deaths, i.e., 12.8% of fatalities in the world. Although cardiovascular diseases have been identified as the leading cause of death in the past decade, they are the most preventable and controllable diseases at the same time. Deaths from cardiovascular diseases show an ever-increasing trend. On the other hand, their early diagnosis plays an important role in improving patients’ health status and decreasing fatalities. Therefore, this study aimed to aid physicians to early diagnose such diseases and assess heart disease risk factors in studied individuals. After studying different papers, selecting different data mining techniques and implementing them on the selected dataset, SVM technique achieved the highest accuracy (84.33%). In SVM as well as the other employed techniques, thal, ca and cp were introduced as the most influential indices on average. This technique is expected to be implemented in future on a localized dataset with nonaggressive indices in general. This, in turn, imposes lower costs and complications on patients.