Rule Mining of Early Diabetes Symptom and Applied Supervised Machine Learning and Cross Validation Approaches based on the Most Important Features to Predict Early-Stage Diabetes

Diabetes is one of several illnesses referred to be "chronic". It is the most prevalent disease that significantly impacts the population. Although there are numerous possible causes of diabetes, age, excessive body fat, frailty, fast weight loss, and many other conditions are the ones that occur most often. Diabetes patients are more susceptible to developing a number of diseases, including heart disease, kidney issues, damaged nerves, damaged blood vessels, and blindness. It is challenging to diagnose the ailment, and it is both costly and difficult to anticipate how it will develop. Machine learning (ML) offers tremendous potential to develop useful applications for earlier detection, diagnosis, and therapy, as well as the treatment of many disorders, which is why medical experts are particularly interested in it. This study aims to develop a model that can reliably and precisely identify diabetes. Following that, association rule mining was employed to find the common indications of diabetic symptoms. This study also presents a useful model for diabetes prediction that makes use of various machine-learning approaches to improve diabetes categorization and increase the precision of diabetes prediction. Machine learning methods utilized in the early stage diabetes prediction include Gaussian Naive Bayes, ExtraTreesClassifier, Decision Trees, K-Nearest Neighbors, Random Forest Classifier, Support Vector Machine, and Logistic Regression. The choice of the dataset's major attribute was then made after considering a total of six different ways. Then, a total of 10 alternative models utilized for early-stage diabetes prediction were applied to the previously selected and highlighted dataset. Accuracy, precision, recall, and F-measure are some of the metrics used to evaluate the various performance levels of these models. The performance matrices show that the ExtraTreesClassifier performed at the maximum level possible, earning a perfect score in each area with accuracy, recall, precision, and F1 score of 100%. Therefore, we can assert that the performance of our ExtraTreesClassifier model is superior to that of the already available work. Clinical doctors who read this article will gain new knowledge and be better able to identify early diabetes.

Since the body's inability to metabolize glucose is the underlying cause of diabetes mellitus, a chronic illness that is on the rise. Based on demographic information and laboratory results from medical visits, a different study [14] developed a prediction model with great sensitivity and selectivity to better identify Canadian patients at risk of developing diabetes mellitus. Most Americans die from cardiovascular disease and diabetes. Recognizing and preparing for these diseases' patient presentations is the first step in halting their progression. They analyze the capabilities of algorithms that use machine learning to identify identified as high individuals and test outcomes using survey data, [15] uncovering data factors that contribute to the prevalence of diseases.
On the other hand, in recent years, machine learning has become an increasingly popular method for the analysis of datasets pertaining to medical subjects. The objective of this work is to develop a solution by identifying the best model for the early detection of diabetes using techniques from machine learning. The following is a list of the most important ramifications that this study has:  The data conversion for the further simulation.

Literature Review
This section covered the research that only makes use of the diabetes dataset, and it included a review of the relevant literature to our study. We examined the research procedures as well as the findings of the investigations when we were reviewing those works of literature.
Tripathi et al. [16] Diabetes impacts glucose. Insulin resistance causes difficulties. Undiagnosed, it destroys kidneys, nerves, and eyes. Technology enhances individualized medicine. Healthcare uses machine learning, a fast-growing predictive analysis subfield. These tools detect illnesses. Machine learning categorization and diabetes-related factors predict diabetes early in this study. It improves patient diagnosis and produces clinically useful results. Four ML algorithms predict early diabetes. LDA, KNN, SVM, Random Forest (RF). Pima Indian Diabetes Database from UC Irvine's machine learning repository is used in experiments (PIDD). Categorization systems are evaluated using the metrics sensitivity (recall), precision, specificity, F-score, and accuracy. Suitable categories. The accuracy of RF categorisation is 87.66%.
Alaa Khaleel et al. [17] Diabetics have high blood sugar levels. It's deadly. Early detection lowers diabetes severity and risk. Machine learning, especially in disease, is becoming more popular in medicine due to its ubiquity. This study serves as a diabetes diagnosis model. Using precision, recall, and F1-measure, we assess the prediction accuracy of potent machine learning (ML) systems. Diabetic symptoms were predicted by the PIDD dataset. LR, NB, and KNN exhibited 94%, 79%, and 69% accuracy, respectively. LR predicts diabetes better.
Zou et al. [18] Diabetes produces hyperglycemia. It's risky. Morbidity will cause 642 million diabetes cases by 2040. 10% get diabetes. Alarming. Medical and public health employ machine learning. Their decision trees, random forests, and neural networks predicted diabetes. Hospital patients in Luzhou, China, are examined. 14. This experiment cross-validated five models. They used top methodologies to evaluate their viability. High-performing approaches did this. 68994 healthy and diabetes patients were trained. Unbalanced data pulled 5 times more. Answer: five-test average. MRMR and PCA lowered this study's dimension (mRMR). Random forest prediction was best overall (ACC = 0.8084).
Tigga et al. [19] India has approximately 30 million diabetics many others are in danger. To avoid diabetes and its complications, early diagnosis and treatment are therefore essential. A study like this one estimates diabetes risk from lifestyle and family history. Machine learning algorithms, which are accurate, predicted type 2 diabetes risk. Medical professionals require precision. Individuals can estimate their diabetes risk after the model is trained. For the trial, 952 participants completed an online and offline questionnaire. The 18-question survey includes sections on family history, lifestyle, and health. The same methods were used to assess the Pima Native Diabetes database. For both datasets, the Random Forest Classifier performs best.
Sisodia et al. [20] Diabetes elevates glucose (sugar). Undiagnosed diabetes can create several issues. The patient always sees a doctor at a diagnostic facility because identification takes so long. Machine learning solves this major problem. This study seeks a model that reliably predicts diabetes risk. This experiment detects early diabetes using decision trees, SVMs, and naive bayes. The Pima Indians Diabetes Database is studied by the machine learning repository at UC Irvine (PIDD). Each strategy is assessed using recall, F-measure, precision, and accuracy. Incident categorization measures accuracy. Naive Bayes' 76.30% accuracy beats others. ROC curves verify this.
Ramesh et al. [21] Millions have diabetes. It worsens organ failure and life quality. Diabetics need early detection and monitoring. Remote patient monitoring facilitates treatment. For automated diabetes risk prediction and management, this study suggests an end-to-end remote monitoring system. Smartphones, smart wearables, and health devices fuel the platform. A Pima Indian Diabetes Database Support Vector Machine predicted diabetes risk after scaling, imputation, selection, and augmentation. Tenfold stratified cross validation produced 83.20% accuracy, 87.20% sensitivity, and 79% specificity. Consistent. Smartphones and smartwatches measure vitals, slow diabetes, and connect with doctors. The unobtrusive, economical, and vendor-interoperable platform aids doctors in their decisions by using the most recent diabetes risk projections and lifestyle data.
Perveen et al. [22] Interventional programs can save time and dollars by targeting high-risk diabetics. A trusted prognostic model, the Framingham Diabetes Risk Scoring Model (FDRSM), was put to the test using a Hidden Markov Model (HMM), a machine learning method. 8-year projection of diabetes risk? FDRSM performance has not been verified by any HMM studies. HMM assessed the 8-year diabetes risk from 172,168 primary care patients' EMRs. Our 911-person sample exhibited an AROC of 86.9% with all risk factors present with follow-up data, which was higher than the 78.6% and 85% in a previous FDRSM validation analysis conducted on the same Canadian population and the Framingham study, respectively. Including all risk factors and follow-up information, 911 research participants had an AROC of 86.9%. In comparison to Canadian and Framingham FDRSM validation research, the suggested HMM discriminates better. Eight-year diabetes risk can be determined by HMM.
Maniruzzaman et al. [23] Diabetes causes high blood sugar. It can cause heart attack, kidney failure, stroke, and other serious illnesses. 422 million people had diabetes in 2014. 642 million will live on Earth by 2040. This project creates an ML-based diabetes diagnosis system. Kavakiotis et al. [24] Highthroughput clinical data and genetic data from massive electronic health records have expanded thanks to biotechnology and health research (EHRs). Biosciences must make use of data mining and machine learning to evaluate all data. World health is impacted by DM. Studies have been done on diabetes management, etiopathophysiology, and other topics. Methods for diabetes research, including machine learning and data mining, will be studied for prediction, diagnosis, complications, genetic background and environment, healthcare, and treatment. Popularity is first. ML algorithms were utilized. Methods were 85% monitored, but association rules were not. SVMs dominate. Clinical data ruled. The names of the selected publications show that extracting vital knowledge generates new ideas that improve DM comprehension and research.
Hasan et al. [25] Diabetes elevates glucose. Diabetes. Early detection reduces diabetes risk. Outliers and unlabeled data complicate diabetes prediction. Outlier rejection, data standardization, feature selection, K-fold cross-validation, several Machine Learning (ML) classifiers (k-nearest Neighbor, Decision Trees, Random Forest, AdaBoost, Naive Bayes, and XGBoost), and Multilayer Perceptron were all used in this literature's robust diabetes prediction framework (MLP). Guidelines: ML Area ROC Curve weights (AUC). This study suggests weighting diabetes prediction ML models. Grid search maximizes hyperparameter adjustment AUC. This study used the Pima Indian Diabetes Dataset and identical experimental parameters. Our recommended ensembling classifier is the most successful classifier from exhaustive testing, with a diagnostic odds ratio of 66.234, an AUC of 0.950, a sensitivity of 0.789, a specificity of 0.934, a false omission rate of 0.092, and a diagnostic odds ratio of 0.934. 2.0% lower AUC. Poor diabetes prediction. Same dataset may improve diabetes prediction systems. Diabetic prognosis.
Yahyaoui et al. [26] DSS helps doctors and nurses make clinical decisions. This is needed due to escalating deadly diseases. Diabetes kills globally. Raising blood sugar may influence other organs. Diabetes. By 2035, there will be 592 million cases of diabetes worldwide, according to the International Diabetes Federation (IDA). This research proposes a machine learning-based diabetes prediction DSS. Machine vs. deep learning. SVM and Random Forest classifiers are popular (RF). Diabetics were predicted and identified by fully convolutional neural networks (CNNs) (DL). To assess the proposed strategy, 768 samples with 8 characteristics were used from the public Pima Indians Diabetes database. 500 samples were non-diabetic, 268 were. SVM 83.67%, RF 76.81%, and DL 65.38%. RF outperforms deep learning and SVM in diabetes prediction.
Sonar et al. [27] Diabetics die. It causes blindness, urinary system problems, coronary heart disease, and more. After the consultation, the patient must drive to a diagnostic center for their reports, which takes time and money. Machine learning can now solve it. Polygenic disease is diagnosed using cutting-edge information processing. Anticipating illness allows for critical care. Data from a lot of unviewed diabetes-related information is removed. This study will improve diabetic risk prediction. SVM algorithms, naïve bayes networks, decision trees, and AI networks characterize models (ANN). 85%, 77%, and 77.3% of precision are estimated by Decision Tree, Naive Bayes, and Support Vector Machine models, respectively. Results are accurate.
Sivaranjani et al. [28] One of the most widespread and deadly diseases in the world, including India, is diabetes. Lifestyle, genetics, stress, and age can cause diabetes at any age. Untreated diabetes, regardless of cause, can have catastrophic consequences. Several methods can anticipate diabetes and its complications. Researchers employed SVM and Random Forest machine learning techniques in the suggested work (RF). These algorithms estimate diabetes risk. After data preparation, step forward and backward feature selection identifies predictive qualities. Selecting features, PCA dimensionality reduction is studied. Random Forest (RF) has an 83% prediction accuracy, compared to SVM's 81.4%. Saha et al. [29] Diabetes, a prevalent condition, can strike at any age. These diseases activate when blood sugar rises. Predicting diabetes is crucial right now. The Indian Pima Dataset has undergone many techniques. This dataset includes Pima Indian women's 1965 research. Most academics are trying to apply difficult methods to datasets; however, a lot of in-depth research lacks easy strategies. Our study included RF, SVM, and NN (NN). They used these methods in several ways. They added several methods to the main dataset. They then identified diabetics using preprocessing methods. They compared and got the best outcomes using those methods. Neural Network was the most accurate method (80.4%).
Posonia et al. [30] Diabetes mellitus, which can cause severe birth abnormalities, affects most Indian pregnant women. Several cutting-edge blood test technologies can detect diabetes. Diabetes results from elevated blood glucose. Untreated diabetes can cause renal damage and heart attacks. Thus, discovering and studying gestational diabetes requires learning models and rigorous research. This study suggested diabetes prediction using machine learning. Calculation using a decision tree, J48. One of the best classification models is "Decision Tree." A goal column showing favorable or bad outcomes and the major 8 characteristics of 768 patients were evaluated. Our Weka experiment showed that the Decision Tree J48 calculation is more effective and faster.
Pavani et al. [31] Today's healthcare uses AI and ML. The WHO says diabetes affects the most individuals worldwide. High glucose levels induce it. Diagnosing diabetes may involve other factors. This research aims to develop a diabetes-prediction system. This study employed ML methods to predict early diabetes. Among the technologies used in machine learning include support vector machines, logistic regression, decision trees, random forests, gradient boost, K-nearest neighbor methods, and Naive Bayes. These algorithms are evaluated using precision, accuracy, recall, and F-measure. This study compares approaches to improve precision. The accuracy of the Naive Base Method and Random Forest algorithm is 80%.
Let's quickly go over the methodology that was applied in this research now that this point has been established. The section on the methodology, which will be presented after this part, will contain additional information on this topic.

Methodology
The ten distinct primary sections had to be completed in order for this study to be finished. One section of the "Data Collection" section is devoted to the presentation and discussion of the specifics of the dataset's description. The data set's past has also been meticulously reviewed and dissected. The string data is converted into numerical data in the "Data Conversion" section. The report's "Data Preprocessing" section has been updated with the necessary data preprocessing techniques. Additionally, six different models have been used to determine which important features are most advantageous in the section titled "Important Feature Selection". In the "Train Test Split" portion of the document, the dataset has been split into a train set and a test set so that the experiment can be run on both of them. The "Applied Model" portion of the study contains a list of all 10 models that were used to evaluate the dataset and forecast the chance of developing early diabetes. The "Result Analysis" Section now discusses the model whose performance was determined to be the best overall after all models were evaluated and the article's section on "Rule Mining" briefly mentioned how datasets frequently correlate with one another.. In order to accurately forecast the onset of early diabetes using machine learning, the results of the model that performed the best have been examined and contrasted with those of previously published work. The systematic method followed in this inquiry is shown in Figure 1.

Figure 1
Ten main sections were required for this study. In the "Data Collection" section, a description of the dataset is displayed and discussed. The history of the data set has also been examined. In "Data Conversion," string data is transformed into numerical data. Data preprocessing techniques are included in "Data Preprocessing". Six models were employed in the "Important Feature Selection" section to choose the top features. To enable the experiment to be run on both sets, the dataset has been split into a train set and a test set under the "Train Test Split" section. Ten models were used in the "Applied Model" section to forecast early diabetes. The evaluation of all models is followed by a discussion of the model that performed the best overall in the "Result Analysis" and in the section Datasets frequent association was mentioned in "Rule Mining". In order to accurately predict early diabetes using machine learning, the model that performed the best was looked at and contrasted to past studies.
The dataset Section has now started the working process of this study in order to discuss the dataset's attributes in the following way.

A. Dataset
The signs and symptoms of diabetic individuals who have just received a diagnosis or who are at risk of acquiring the disease are detailed in this dataset. For this information, the patients at the Sylhet Diabetes Hospital in Sylhet, Bangladesh, completed direct questionnaires, and a medical expert approved the project before it was carried out. Diabetes is linked to 520 different patients and 16 different characteristics. Early Stage Diabetes Risk Prediction Dataset (ESDRPD) has been collected from kaggle [32]. There is one continuous attribute in addition to fifteen different categories of attributes. The dataset includes a total of 15 features, one of which is the target variable defined as class. The dataset's executive summary is displayed in Table 1. The statistical data analysis of the dataset has been accomplished in the following section to ease in a better understanding of the dataset.

B. Statistical Data Analysis
As was mentioned earlier, the diabetes dataset is made up of 520 different instances and 16 different features. There is one continuous attribute in addition to fifteen different categories of attributes. An explanation of some characteristics that are associated with medicine is provided below.
(1) Age This component determines the age of the individual.

(2) Gender
This aspect pertains to the gender of the individual who is participating in the activity [33] . There are 328 men, which is 65.60% of the total population, but there are 192 more women than men. This means that women outnumber men by 192. As a consequence of this, the proportion of females to males in the entire population is 38.40%. Positive representation of women (females make up 54.06% of the population, while males make up 45.94%) and negative representation of men (45.94% of the population) may be seen in the gender distribution (females make up 9.5%, while men make up 90.50%).

(3) Polyuria
The disorder known as polyuria causes a person to urinate more frequently than is normal and to pass excessive or unusually big amounts of urine each time [34]. More than 3 liters of urine per day are frequently passed, which is referred to as polyuria. This is in contrast to the typical daily output of 1 to 2 liters of urine for adults. This feature determines whether or not the individual had an issue with urinating an excessive amount. Distribution of Polyuria: Positive (yes = 75.94%, No = 24.06%) and Negative (yes = 7.5%, No = 92.50%)

(4) Polydipsia
The medical term for increased thirst is polydipsia. A persistent, abnormal drive to drink fluids is known as excessive thirst [34]. It is a response to your body losing fluid. This feature records whether or not the participant drank excessively or experienced excessive thirst. One of the main early indicators of diabetes is polydipsia, or excessive thirst [35]. Positive polydipsia cases account for 70.31% of all cases, while negative polydipsia cases account for 96% of all cases.

(5) Sudden Weight Loss
When a person loses a considerable amount of weight without making any changes to their eating habits or exercise routines, this is considered to be unexplained weight loss. Those with type 2 diabetes are not immune to it, although type 1 diabetics are more likely to experience it. Both a positive (Yes = 58.75%, No = 41.25%) and negative (Yes = 14.50%, No = 85.50%) distribution can be seen with abrupt weight loss.

(6) Weakness
If an individual possesses this characteristic, it can be deduced whether or not they had ever experienced a time in their life when they felt helpless or unable of doing something. On the subject of the weakness, the replies are evenly split between those who are positive (Yes = 68.12% and No: 31.87%) and those that are negative (Yes = 43.50% and No = 56.50%).

(7) Polyphagia
Polyphagia, also referred to as hyperphagia, is characterized by an overwhelming and unquenchable need to consume food. Throughout the course of the study, this trait helped researchers to determine whether or not a subject ever experienced excessive or intense hunger.Polyuria can be either positive (yes, which equals 59.06%, or negative (no, which equals 40.94%) or negative (yes, which equals 24%, or negative, which equals 76.50%). Polyphagia is characterized by an abnormally high level of hunger that leads to a considerable and continuing increase in appetite. It is a primary indicator of diabetes [36], as well as one of its main symptoms.

(8) Genital Thrush
This quality reveals whether or not the individual was suffering from a yeast infection during the course of the investigation. Thrush is the medical term for an infection caused by yeast (candida albicans) [36]. Candida albicans is able to thrive in environments that are more conducive to its growth when there is a significant amount of sugar present (Thrush, 2019 This makes it impossible to read small print. Cloudiness in one's vision is usually brought on by fluctuations in one's blood sugar [37]. It is possible for a number of eye illnesses, such as nearsightedness or farsightedness, which weaken the eye's capacity to concentrate, to bring about a condition known as blurred vision. This feature will record information regarding the participant's vision, including whether or not they experienced a period of obstructed vision. The two categories of participant responses for times when they experienced foggy vision were "Positive" (Yes = 54.69% and No = 45.31%) and "Negative" (Yes = 29.00% and No = 71.00%) respectively.

(10) Itching
Whether or not the individual had an episode of itching is recorded by this feature. The two different participation categories are known as "Positive" (Yes = 48.12% and No = 51.88%) and "Negative" (Yes = 49.50% and No = 50.50%) respectively.

(11) Irritability
This feature determines whether or not the individual had a fit of irritation at any point during their participation [37]. The two distinct groups of people who took part in the study are referred to as "Positive" (where Yes = 34.38% and No = 65.62%) and "Negative" (where Yes = 8% and No = 92.00%) accordingly.

(12) Delayed Healing
This feature determines whether or not the subject experienced slowed healing after being injured and records that information [38]. The incidence of delayed healing according to: Negative (Yes = 43%, and No = 57%), and positive (Yes = 47.81%, and No = 52.19%)

(13) Partial Paresis
The disease known as paresis is characterized by a reduction in the patient's ability to move voluntarily [39]. It is possible for it to be a symptom of diabetes. Positive (Yes = 60%, No = 40%), and Negative (Yes = 16%, and No = 84%) are the proportions of people who had, respectively, had an episode of muscular weakness.

(14) Muscle Stiffness
Stiffness in the muscles is characterized by a feeling of constriction in the affected area, which frequently results in discomfort and makes it difficult to move. Muscle stiffness can be brought on by misuse of a particular muscle, or it might be an early warning sign of an underlying health problem. If a person experienced a period of muscle stiffness, this characteristic records it. Below is a breakdown of the participants' proportion who reported having a case of muscle stiffness: Positive (Yes = 42.19%, No = 57.81%) and Negative (Yes = 30%, No = 70%).

(15) Alopecia
Diabetes patients have a higher chance of acquiring alopecia areata. Any area of the body that has alopecia will experience hair loss. This factor determines whether or not the individual had hair loss during their time inside the study. Those who experienced hair loss make up a total of Positive (Yes = 24.38%, No = 75.62%) and Negative (Yes = 50.50%, No = 49.50%).

(16) Obesity
This attribute determines whether or not the individual is deemed to be obese. The percentage of participants that are positive (Yes = 19.06%, No = 80.94%) and negative (Yes = 13.50%, No = 86.50%).

(17) Class
If a person has type 2 diabetes or not can be determined by this trait. 62% of the individuals had diabetes type 2, the most prevalent form of the illness.
With the exception of age, which is a number, all the characteristics are notional. The distribution of all targeted variable is shown in Figure 2 (a) to (q).

Features Correlation
The importance of the feature link with diabetes cannot be overstated when it comes to early diabetes predictions. In our data, the correlations scale from -1 to 1 for the following variables are as follows: age has a correlation of 0.10, gender has a correlation of -0.44, polyuria has a correlation of 0.66, polydipsia has a correlation of 0.64, sudden weight loss has a correlation of 0.43, weakness has a correlation of 0.24, polyphagia has a correlation of 0.34, genital thrush has a correlation of 0.11, visual blurring has a correlation of 0. Figure 3 is an instance of the feature correlation that has been demonstrated to be associated with each others.

Figure 3: Illustration of all Features Correlation
In order to complete mining association rules, association rule mining will now be performed on the dataset. Before continuing with that step, the string data must first undergo data conversion so that it can be converted into a numerical value.

C. Data Preprocessing
In this part, data preprocessing methods have been employed in the following manner in preparation for the subsequent simulation that would predict early onset of diabetes. Converting the target's "Class" values to their corresponding numerical values, in other words, changing "positive" to "1" and "negative" to "0." Separating the Target (Class) feature from the rest of the 15 characteristics, and storing them. Furthermore, data normalization have been applied for the continues feature "Age". The next step, which is to determine the most important feature, will include using six distinct methods, as described in the following section.

(1) Important Feature Selection (a) Pearson
Pearson The correlation method evaluates the linear relationship that exists between two characteristics and generates a value that can range anywhere from -1 to 1 to show the degree to which the two characteristics are related to one another. This value shows the degree to which the linear relationship that exists between the two characteristics can be described as a correlation. The construction of a correlation matrix requires the use of this quantity. The construction of a correlation matrix is made possible by the use of correlation. By computing the relationship that exists between each feature and the goal variable, this determines the degree to which the two features are interdependent on one another. This determines the degree to which the two features are dependent on one another. When this stage of IJIRMPS Volume 11, Issue 3, (May-June 2023) E-ISSN: 2349-7300 the process is finished, the next step is to discover the attribute that has the greatest substantial influence on the variable that is being sought.

(b) Chi-2
The chi-2 test is a type of statistical analysis that allows you to compare an investigation's actual results to what was anticipated. This test will assess whether a disparity between actual and predicted data can be attributed to random variation or whether it can be attributed to a relationship between the variables that are the subject of the research. The k value is set to 10 for the selected dataset. This test's goal is to establish whether the relationship between the variables that are the subject of the study and the discrepancy can be determined.

(c) Recursive Feature Elimination
Recursive Feature Elimination (RFE) is a technique that involves selecting features that are appropriate for a model and gradually removing the weakest features until the necessary number of features is obtained. RFE is the common abbreviation for recursive feature elimination. In this study the parameters are set as following, estimator = LogisticRegression(), n_features_to_select = 100, step = 10, and verbose = 5.

(d) Logistic Regression L1 (LR L1)
Within the area of machine learning, the use of L1 regularized logistic regression is currently considered to be standard practice. This method is applicable to a wide range of classification issues, in particular those that involve a considerable number of distinct attributes. It is vital to discover a solution to a problem involving convex optimization before employing L1 regularized logistic regression since this type of problem requires it. Hence the parameters are set to the penalty = l2, and threshold = 1.25 × median.

(e) Random Forest (RF)
A random selection of the features and the observations from the dataset are used to build each decision tree that makes up a random forest, which can have anywhere between 400 and 2,000 decision trees. In a random forest, there could be anywhere between 400 and 2,000 decision trees. A random forest can have anything between 400 and 12,000 decision trees at any given time. Therefore the parameters are set to the n_estimators = 100, and threshold = 1.25 × median.

(f) LightGBM
LightGBM, a gradient boosting framework that prioritizes teaching using the tree-based learning approach. The LightGBM creates trees vertically, in contrast to previous techniques. Most tree-growing algorithms develop their trees horizontally. This suggests that, unlike other methods, the LightGBM approach constructs trees leaf-wise rather than level-wise. So the parameters are set to the n_estimators = 500, learning_rate = 0.05, num_leaves = 32, colsample_bytree = 0.2, reg_alpha = 3, reg_lambda = 1, min_split_gain = 0.01, and min_child_weight = 40.   Table 2, the first ten elements with a value that is more than or equal to four in total have been chosen as essential features that will be placed to models to predict early diabetes. These characteristics will be used to predict whether a person will get diabetes or not. These are Polyuria, Polydipsia, Gender, weakness, visual blurring, sudden weight loss, partial paresis, Itching, Irritability, and Age.

(2) Train Test Split
In order to carry out the application of the models and perform the task of predicting early diabetes, A train set and a test set of the 10 significant characteristics that were chosen have been created. As a result, 20% of the dataset was used for testing, while 80% of the dataset was used for training. The summarized description of the train test split is shown in Table 3.

(3) K-Fold Cross Validation
The drawbacks of the hold-out method can be minimized by using k-Fold cross-validation. The "test just once bottleneck" can be avoided with the use of k-Fold, which offers a fresh approach to dataset segmentation. 1. Choose k folds, the number of folds. (If at all possible, the dataset should be split into k equal halves). 2. Then, k-1 folds should be used as the practice set. The test set will consist of the remaining fold. 3. Use training set to put the model through its paces. 4. In order to use cross-validation, a new model must be trained separately from the model that was trained in the prior iteration. 5. Verify your findings using the test set. 6. Maintain a record of the validation's results. 7. Steps 3-6 must be repeated K times.
In this analysis, K was found to have a value of 8 for each of the 10 different models.

D. Applied Models (1) Decision Tree (DT)
An example of a decision support tool is a decision tree, which uses a tree-like model to represent decisions and the likely effects of those actions. The results of random events, resource costs, and resource utility are a few examples of these potential implications. This can be used to display an algorithm that is nothing more than a set of conditional control statements. Decision trees, or more precisely decision analysis, are a common tool in the field of operations research for identifying the strategy that has the highest likelihood of success. In the area of machine learning, decision trees are a common tool. In this case the criterion is set to 'gini'.

(2) Random Forest Classifier (RFC)
A random forest is a classification technique composed of numerous independent decision trees. It attempts to produce an uncorrelated forest of trees whose forecast by committee is more accurate than that of any individual tree using bagging and feature randomness when generating each individual tree. It is thought that doing this will result in a prognosis that is more precise. Therefore the parameters are set as follows, where criterion was 'gini', and n_estimators was 100.

(3) Support Vector Machine (SVM)
The support vector machine is a widely used and flexible supervised machine learning technique (SVM). With its aid, activities involving classification and regression can both be completed. However, the categorization task will be the focus of debate in this thread. It is typically seen as being optimal for medium-and small-sized data sets. Finding the ideal hyperplane that divides the data points into two components linearly while also maximizing the margin is the major goal of the support vector machine (SVM). So, the kernelis set to the 'linear', and random_state is set to the 0 for the better performance.

(4) XGBoost Classifier (XGBC)
The gradient boosted trees technique is extensively used and is successfully implemented in a piece of open-source software called XGBoost. Gradient boosting, a method of supervised learning, combines the forecasts of a number of less reliable and simpler models in an effort to provide an accurate forecast of a target variable. Regression trees act as the weak learners when gradient boosting is employed for regression. In each of these trees, a leaf that stores a continuous score is connected to each input data point. By using a convex loss function that is based on the difference between the anticipated and target outputs plus a penalty term for the model's complexity, XGBoost can minimize a regularized (L1 and L2) objective function (in other words, the regression tree functions). By adding new trees to the mix that make predictions regarding the residuals or errors produced by prior trees, the training operation is carried out iteratively. The final prediction is then created by combining these new trees with the previous trees. Gradient boosting is the name of the technique, and it refers to the way it lessens the amount of information that is lost as new models are added. Therefore the parameters are set as follows, objective = reg:linear, colsample_bytree = 0.3, learning_rate = 0.1, max_depth = 5, alpha = 10, and n_estimators = 10.

(5) K-Nearest Neighbor (KNN)
The KNN algorithm, a type of supervised machine learning technique, is straightforward and effective for both classification and regression issues. Although it is easy to build and comprehend, it has a big drawback in that it becomes substantially slower as data usage increases. So n_neighbors is set to the1-10, metric is set to the 'minkowski', and p is set to 2.

(6) Gaussian Naïve Bayes (GNB)
The Gaussian Naive Bayes algorithm is used to illustrate a probabilistic classification method. This strategy is based on the Bayes theorem and strong independence presumptions. When working with continuous data, a typical approach is to assume that the continuous values associated with each class are distributed according to a normal (or Gaussian) distribution. This is carried out to facilitate working with continuous data. On the likelihood of the qualities, we'll proceed with the following assumption: Continuous valued features and models are believed to individually correspond to a Gaussian distribution in the Gaussian Naive Bayes approach (also known as a normal distribution).

(7) AdaBoost Classifier (AdaBC)
A meta-estimator called an AdaBoost [34] classifier operates by first fitting a classifier on the initial dataset, and then fitting additional copies of the classifier on the same dataset with the weights of incorrectly classified instances adjusted in a way that causes subsequent classifiers to focus more on challenging cases. This process is repeated till the precision is achieved. Up until the required level of categorization accuracy, this process is repeated as often as required. Until the most accurate classifier is found, this process is done many times.

(8) Logistic Regression (LR)
The appropriate regression analysis to utilize when a dependent variable is dichotomous is logistic regression (binary). The logistic regression, like all regression studies, is a predictive analysis. We apply logistic regression to characterize the data and to explain the association between one dependent binary variable and one or more independent nominal, ordinal, interval, or ratio-level variables. Where random_state is set to the Zero, and penalty is set to 12. more. It offers a prediction model in the shape of a collection of simple prediction models, the bulk of which are decision trees. The accuracy of models of this type is typically considered to be somewhat lacking in common consensus [40], [41]. When a decision tree is the weak learner, the technique that results is known as gradient-boosted trees, and it frequently outperforms random forest. When a decision tree is the weak learner, gradient-boosted trees are created. When a decision tree is determined to be a poor learner, a gradient-boosted tree will be constructed [40], [41], [42]. It is created in a similar stage-bystage fashion to existing boosting techniques, but it generalizes other techniques by enabling optimization of any differentiable loss function. It is made in a similar way to previous boosting techniques. As a result of performing this technique, a gradient-boosted trees model is created.

(10) ExtraTrees Classifier (ETC)
The program known as Train With AutoML is an application that puts into action an approach to ensemble supervised machine learning known as additional trees (short for excessively randomized trees). Excessively randomized trees is what the term "extra trees" refers to in its longer form. This method is also sometimes referred to by the term "extremely randomized trees," which is a shortcut for the longer phrase "highly randomized trees." This tactic makes use of decision trees, and the strategy's shorter term, "additional trees," alludes to the decision trees that are implemented in the tactic. Therefore the parameters are set as follows, n_estimators = 100, and random_state = 0.

E. Performance Metrics
In this study, we make use of four performance indicators that are very common: Accuracy, Precision, Recall, F1-Score (1-4).
where TP, TN, FP, FN, and FPFN represent, respectively, true positive, true negative, false positive, and false negative. P, R, and A, which stand for Precision, Recall, and Accuracy, respectively, are similar to P, R, and A [43].

Results Analysis A. Association Rule Mining (Unsupervised Technique)
Using the applymap function resulted in replacing "Yes" with 1 and "No" with 0. This was done since all categorical features were specified with either "Yes" or "No." Because of this, the dataset is now prepared for the application of association rule mining. The method of "association rule mining" includes identifying intriguing connections and correlations among enormous numbers of data objects. This can be accomplished by using various techniques. By revealing information about its occurrence frequency, this rule provides insight into the frequency with which a particular item collection can be discovered in a dataset. To measure association, the following matrices used as follows: (a) Support: The prior probability of P and Q serves as the rule's support [44]. The following is the equation for support: The support in this case is denoted by "Sup". The overall number of transactions is "n".
(b) Confidence: The conditional likelihood that the consequent will occur is the antecedent. The conditional probability of Q given P is the rule's confidence constraint [45].
(c) Lift: By lift value, a rule's importance is determined. In essence, the rule's filters may be used to specify lift range. By dividing the actual and predicted confidence values of the rule, life is determined [46].
The rules applied to the model for diabetes prediction considering all features as all the consequences ultimate result is diabetes. The apriori rule mining technique was utilized in this study so that the association rule could be mined from our dataset. To explore the unsupervised technique on the selected data set as well as to open up the further future work door, this study analysis the association rule mining among the features. In order to accommodate this, the minimum support has been set to 0.1, and the minimum threshold has been established at 0.7. A total of 1150 rules were generated for the dataset as a result of the apriori rule mining technique. The top 50 rules from our dataset are displayed in Table 4. The accuracy is represented by the color light blue, the accuracy of the cross-validation is represented by the color teal blue, the precision is represented by the color ocean blue, the recall is represented by the color blue, and the F1 score is represented by the color deep blue. Figure 4 demonstrates that the model with the highest performance is ExtraTreesClassifier. The accuracy of this model is 100%, its cross-validation accuracy is 97.60%, its precision is also 100%, its recall is also 100%, and its F1 score is also 100%. In spite of the fact that Random Forest has also accomplished nearly the same level of success as ExtraTreesClassifier, with value accuracy of 100%, accuracy in cross-validation of 97.36%, precision of 100%, recall of 100%, and F1 Score of 100%. The total performance of all of the models is summarized in Table 5, which provides an overview of the situation. Following this, a brief discussion on the confusion matrix as well as the roc for each of the applied models will be presented in the following part. Figure 5 is a graphical representation of the confusion matrix that describes the anticipated results of an experiment. In addition to that, the ROC-AUC assessments of the model's performance have been shown in this study. The term "receiver operating characteristic curve" (often abbreviated as "ROC curve") refers to a curve that shows the genuine positive rate on the ordinate of the graph and the false positive rate on the abscissa. Another common abbreviation for this type of curve is "ROC curve." It is the end result of combining the border values of numerous different areas into a single one. A measure of the likelihood that the computed score of the positive sample will be higher than the calculated value of the negative sample, the area under the ROC curve (AUC) is also known as the area under the receiver operating characteristic (ROC) curve. When samples are selected at random, it is possible to investigate both the benefits and drawbacks of using the prediction model. Figure 7, which once more depicts the ROC curve for an experiment, reveals that the average AUC value for our model ExtraTreesClassifier is 100%. The ROC Curve for the Experiment Figure 6 demonstrates that the Logistic In the following section, we are going to compare the work that has already been done with the model that we have proposed.

Compare with the Existing Work
Boosted Regression is the name of the model that Tripathi and colleagues [13] utilized, and as a result of using it, they were able to attain an accuracy rate of 90.91%. Next, Alaa Khaleel et al. [18] [18] was successful in accomplishing an accuracy of 86.9%. Following are Kavakitis et al. [20] The Support Vector Machine was used, and an accuracy success rate of 85% was achieved. Yahyaoui and co. are next. The accuracy for SVM, Random Forest (RF), and DL was 83.67% for SVM, 76.81% for RF, and 65.38% for DL in [22], which used these methods. After that, Sonar and his colleagues [23] employed the Decision Tree, the Naive Bayes model, and the Support Vector Machine. They were able to achieve accuracy levels of 85% for the Decision Tree, 77% for the Naive Bayes model, and 77.3% for the Support Vector Machine. [23] Subsequently, Sivaranjani et al. [24] used Random Forest (RF) and Support Vector Machines (SVM) to reach an accuracy of 83.3% for RF and 81.4% for SVM, respectively. After that, Saha et al. [25] utilized a neural network, and as a result, they were able to accomplish an accuracy of 80.4%. Pavani et al. [27] proceeded to utilize the Random Forest method in conjunction with the Naive Base Method, and the results that both of these techniques produced had an accuracy of 80%. This is a description of both the Random Forest approach and the Naive Base Method. On the other hand, the ExtraTreesClassifier model that we proposed was able to obtain a level of accuracy of 100%. This had the direct effect of making our model perform substantially better than any of the earlier efforts. The summarized comparison of the early stage diabetics risk prediction model's performance is shown is Table 6.

Conclusions
Data preparation methods such as converting and normalizing the data were performed in order to provide an accurate simulation of the dataset that was utilized for this investigation. This was done in order to create an accurate representation of the results of the study. In addition, association rule mining was utilized in order to identify the typical manifestations of diabetic symptoms and it also open a new door for the future work. After that, a total of six distinct methods were utilized in order to arrive at the final decision about the dataset's primary attribute. Following that, a total of ten distinct models were applied to the previously selected and highlighted dataset. We compared the outcomes that were produced by each model for us so that we could determine which one had provided us with the finest results all around. The ExtraTreesClassifier was able to attain the best feasible level of performance, as shown by the performance matrices, achieving a perfect score in every category (100% accuracy, 100% recall, 100% precision, and 100% F1) It is possible for us to declare that our ExtraTreesClassifier model even performs better than the work that is currently accessible. This study offers clinical physicians something new as well as something that can be of assistance to them. The lack of availability of larger databases was the primary challenge that we faced with our efforts. Yet, in order to optimize a model to its fullest potential, one must initially have access to a sizeable dataset. This is a prerequisite for the process. In the future, we are going to continue our investigation into the problems that are occurring right at this very moment.