Stroke Prediction by Decision Tree Algorithm
- Amir Aliz
- Oct 27, 2021
- 5 min read
Updated: Dec 28, 2022
Decision Tree(DT): The Decision tree algorithm is considered an algorithm for classification problems related to the supervised learning algorithm in terms of machine learning. Supervised learning is or supervised machine learning, is considered as a subcategory of machine learning. It trains the algorithm to classify data by using labelled datasets. Training algorithms includes inputs data and correct outputs, which leads the model to learn.
Project:
According to the WHO, stroke is the second leading cause of death globally, responsible for approximately 11% of total deaths. This project predicts whether a patient is likely to get a stroke based on the input parameters like gender, age, various diseases, and smoking status.
The first step is to get datasets in CSV or excel format to prepare data for the decision tree algorithm.
Load all the basic libraries.

In this project, datasets are stored in MongoDB online server so, it is needed to connect the MongoDB
. Creating a class and function to connect and read data from the MongoDB server and convert datasets to a data frame in order to explore data.
Here I use private method for all of my classes in program

Secure the data and application is the essential issue that should be considered. So by securing Hash Algorithms which are a one-way function, the password cannot get from the hash.
The hash class consist of generate_hash_password and check_password, which get the password from the user and convert it to hash and then check it.

Data is consist of 13 columns features which can be seen in blow. Some of these columns are irrelevant and have no effect on stroke, like _id(this is an id stored in MongoDB), id( the id that own datasets have), ever_married, work_type, which need to remove.

The next step is to clean and prepare data( read this article). In order to reach proper analysis, we need to get rid of extra and none value from datasets. actually, data Science spend most of their time(80%)
There our some steps in proper data which every

analyzer should be consider:
Get Rid of Extra Spaces
Select and Treat All Blank Cells
Convert Numbers Stored as Text into Numbers
Remove Duplicates
Highlight Errors
Change Text to Lower/Upper/Proper Case
Spell Check
Delete all Formatting
In Exploring_data class, the cleaning_data function drop some irrelevant columns(ever_married, Id, work-_type and _id) from data frame.

Next, find None value in datasets. It is very common that we fill none value with median, and here, the bmi column's none value is filled with the median. Then, we find the index of missing value in smoking_status and gender columns remove this missing data from our data frame.

The last step in cleaning data is to convert text values stored in the data into numbers. For analysing, we need just a number so, by using the map function convert our text into a number, and return final result as df name.

In the following step, data should be split into train data and test data. The decision tree algorithm need part of data to be trained and to predict remind data. split function separate train data as self.__X and test data as self.__y.

One of the complicated steps in the decision tree is to define which attribute to choose as root. Random selection is one way that can be used in some cases but is not an assurance that we reach high accuracy. So to solve this problem, attribute selection measure use some criteria such as Entropy Information gain, Gini index, Gain Ratio, reduction in variance and chi-square. The attribute selection measure strategy calculates all of these criteria for every attribute and sorts the value. Finally, attributes are placed into the tree by order, the value with high information gain is placed at the root.

In the Decision_tree class, we use the Gini as a criterion in our decision tree classifier and then fit the model with the classifier.
Then with the accuracy function, check the accuracy.


A common problem in Decision Tree is overfitting. In the theoretical maximum depth in a decision tree is less than the number of all samples. Hence overfitting prevents the algorithm from reaching this. Also, reaching maximum depth in a complex dataset could be laborious and time-consuming. Since an inconsiderable change in reaching maximum depth, optimising the decision tree is a way to avoid overfitting and time-consuming
In this project, optimisation of the decision tree classifier is performed by only pre-pruning. The maximum depth of the tree can be used as a control variable for pre-pruning, which I use four maximum depths here. Then with the accuracy function, check the accuracy.

One reason that the decision tree is so popular is that this algorithm can be used for regression and classification, and not only they do not require feature scaling, but they are accessible can be interpreted as it can visualise a decision tree.
The visualization class is a subclass of its parent class(Decision_tree), so we can utilise objects from the decision tree. In the plot _bar_chart function, we create a bar chart of fractures which shows each feature how much effect on getting a stroke.
In the tree_Graph function, we display the decision tree graph of our algorithm and optimization.

The bar chart of features

This graph is the original graph of our decision tree. In this case, the algorithm computes many nodes with maximum depth. In optimization, we restrict the maximum depth to four, which does not reduce accuracy significantly and decreases the process's time.

optimization graph

This is the interface of my project, which, by inputting username and password and clicking on the green button, the analysis will be started.

For user interface(UI), the Gui class is a great choice, which handles the user input and output.
Any things that happen in the user interface like clicking on the bottom or input username and password, known as an event, and Gui utilize OOP, which is called event-driven.
In this project we use Gui to build our interface. Firstly, we define function for main windows which provide us many default functionality. Each application can have only one main windows.

Secondly, I arrange different elements such as button_ plus, exist_button, label1, etc., which are self-explanatory. Also, it needs to put all our labels and buttons inside our main windows.
button_ plus is designed to start the function main_start by clicking on the button.
exist_button can exist from the application by clicking on it.
label1 and label2 are shows command to guide the user.
enter_username and enter_password are two-element that get username and password from the user.
result_accuracy, label_accuracy_result, result_accuracy2, label_accuracy_result2 will show the accuracy of our Decision tree algorithm for both original Algorithm and optimization.

Next, I use the Pack method to position each element inside the main windows. If you forget to pack any element, it will not be shown in the application.

Finally, define a function to execute our application. It starts with checking the username as password, and if it is correct, it will connect to the server to get data and continue the process.








Comments