classification accuracy in data mining


Shipping During the Data Mining project creation, Create a Testing Data Set is an important option for accuracy. Finally, Section 6 gives conclusions of the work from the observed results. The testing in both the cases is performed based on the given training set of data and by using stratified 10-fold cross validation. The risk of overfitting training data exists because learning algorithms are taught on finite samples: the model may memorize the training samples rather than learning a general rule, i.e., the data-producing model. From the results, it is seen that J48 algorithm is less accurate and takes moderate training time as compared to other algorithms. Data mining: Machine learning, statistics and databases. It is again clear that the training time taken by the Random Tree (1.88s) is very less as compared to J48 and Random Forest algorithms. Let us look at different evaluation parameters for the different algorithms. Since there are a few options to choose the necessary algorithms, it is essential to choose what is the best algorithms. He is a presenter at various user groups and universities. She completed B.Tech. The squared error is the sum of the squared difference between the actual value and the predicted value. Figure10 shows the pre-processing stage of data mining for seven attributes in WEKA. After the model has been trained, it is utilized to make predictions on previously unseen data.

Since Neural Network and Logistic Regression have similar results, it is difficult to distinguish them in the chart. The paper is organized as follows: Section 2 gives definitions and causes of power quality problems like voltage sag, swell, interruption and unbalance along with their typical figures. Montreal, Quebec, Canada: In Proc. Css Characteristics analysis of voltage sag in distribution system using RMS voltage method. Data Quality Lexical Parser Monitoring Table5 shows the results obtained after testing the algorithms using stratified 10-fold cross validation. for a period of time not exceeding 1min. Cite this article. They are expressive enough to model many partitions of the data that are not as easily achieved with classifiers that rely on a single decision boundary such as logistic regression or SVM. Because a predictive models accuracy is typically high (over 90%), it is common to summarize a models performance in terms of the modes error rate. For 2 class ,we get 2 x 2 confusion matrix. Pre-process stage of data mining in WEKA with 7 attributes. Data Partition Log, Measure Levels In fact, more the diversified data, more accurate and better result is obtained. Data mining technology is an effective tool to deal with massive data, and to detect the useful patterns in those data. and 1.8 p.u. 2. Further, the Profit chart will be helpful to find out what is the optimum number of cases that can be chosen. The PQ problems cannot be completely eliminated, but can be minimized up to a limit through various equipment such as custom power devices, power factor corrector circuits, filters, etc. Her research interests are Neural Networks, Power Systems and Power Quality. (2015). If both values are specified in the above screen, both limits are enforced. The data samples obtained from simulations carried out on the system shown in Fig. Sewaiwar, P., & Verma, K. K. (2015). Logistic Regression, Neural Network and Nave Bayes models are other models in The circuit shown in Fig. Dom Take, for example, the identification of email spam. (Scales of measurement|Type of variables), (Shrinkage|Regularization) of Regression Coefficients, (Univariate|Simple|Basic) Linear Regression, Forward and Backward Stepwise (Selection|Regression), (Supervised|Directed) Learning ( Training ) (Problem), (Machine|Statistical) Learning - (Target|Learned|Outcome|Dependent|Response) (Attribute|Variable) (Y|DV), (Threshold|Cut-off) of binary classification, (two class|binary) classification problem (yes/no, false/true), Statistical Learning - Two-fold validation, Resampling through Random Percentage Split, Statistics vs (Machine Learning|Data Mining), Data Mining - (Parameters | Model) (Accuracy | Precision | Fit | Performance) Metrics, Data Mining - (Classifier|Classification Function), Statistics Learning - Prediction Error (Training versus Test), Statistics - (Residual|Error Term|Prediction error|Deviation) (e| ), Statistics - R-squared ( |Coefficient of determination) for Model Accuracy, Statistics - Model Evaluation (Estimation|Validation|Testing), Data Mining - (Anomaly|outlier) Detection, Statistics - (Average|Mean) Squared (MS) prediction error (MSE), Statistics - (F-Statistic|F-test|F-ratio), Data Mining - Root mean squared (Error|Deviation) (RMSE|RMSD). He is always available to learn and share his knowledge. In second data set, three more numeric attributes such as minimum, maximum and average voltages, are added along with 3-phase RMS voltages.

(2014). This data is used for classification by data mining algorithms. The information is same as that shown in Fig. Kingsford, C., & Salzberg, S. L. (2008). Comparing classification algorithms in data mining. For a marketing campaign, there are four The variousdifferences between the three data mining algorithms are presented in Table 1. Role of attribute selection in classification algorithms. It is proposed that data mining can provide answers to the end-users about PQ problems by converting raw data into useful knowledge [28, 29]. The proportion of correctly predicted cases in the test set divided by the total number of predictions on the test set is used to determine accuracy. One is to train and the other is to test data set. statement and Random Forest fits many classification trees to a data set and then combines the prediction from all the correlated trees. After the above data was entered, the following Profit chart can be observed. Apart from that ignored variable exception, everything else is the same across all the four algorithms. Part of The authors declare that they have no competing interests. Data mining has recently obtained popularity within many research fields over classical techniques for the purpose of analyzing data due to (i) a vast increase in the size and number of databases, (ii) the decrease in storage device costs, (iii) an ability to handle data which contains distortion (noise, missing values, etc. , , She completed her B.Tech. Infra As Code, Web the Specify a different data set. When the true values are known, a confusion matrix is just a table that is typically used to represent the performance of a classification model on a set of test data. This screen was ignored in the previous articles but it plays an important role during the Accuracy Measuring in data mining. Percentage of the correct cases out of the selected cases. We use data mining to maximize profit. Figure1 shows typical waveform of a voltage sag. Classification Matrix or the confusion matrix is used to derive various classification accuracy matrices. In a Random Tree, each node is split using the best among the subset of randomly chosen attributes at that node. This test data set will be used to measure the accuracy and other matrices. He has been working with SQL Server for more than 15 years, written articles and coauthored books. The existence of PQ problems greatly affects the safe, reliable and economical operations of electric power systems. Classification of single and multiple PQ disturbances based on DWT and RF classifiers. Khalid, S., & Dwivedi, B. USA: Prentice Hall. Nat Biotechnol, 26(9), 10111013. Then, we will be creating a mining model choosing the Decision Tree algorithm and we will add the rest of the three algorithms later. The performances of J48 decision tree, Multi-Layer Perceptron (MLP) and Nave Bayes classification algorithms were studied with respect to training time and accuracy of prediction [12]. Since you have four models as a solution for the Asha Kiranmai, S., & Jaya Laxmi, A. She has 80 International and National journal papers to her credit. 13461352). Measuring the Accuracy in Data Mining in SQL Server. the same.

Data mining for classification of power quality problems using WEKA and the effect of attributes on classification accuracy, https://doi.org/10.1186/s41601-018-0103-3, Protection and Control of Modern Power Systems, www.nilc.icmc.usp.br/elc-ebralc2012/minicursos/WekaManual-3-6-8.pdf, http://creativecommons.org/licenses/by/4.0/. A longer interruption harms practically all operations of a modern society [1]. Accuracy is tested at the end of the learning process to assess the models ability to predict fresh data. It is important to choose the correct parameters for the profit chart. Figure7 shows the tree diagram of a Random Forest. In the following screenshot, allow selecting data volume to the test data set. Provided by the Springer Nature SharedIt content-sharing initiative. The following screenshot is the legend for the above chart. Electrical power systems quality (2nd ed.). For this, instruments should collect huge amount of data, such as measured currents, voltages and occurrence times. IFAC-Papers OnLine, 49(1), 437442. Rio de Janeiro: IEEE 7th International Conference on Intelligent Systems Design and Applications (ISDA). It indicates the total number of instances, the number of attributes and number of samples under each class of power quality problems along with a bar graph. \text{Mean Absolute Error}= \frac{|p_1-a_1|+\dots+|p_n-a_n|}{n} 2528). Power quality disturbances classification using data mining technique. However, there are few other parameters that are derived from the above classification matrix. How about the overall fit of the model, the accuracy of the model? Fault record detection with random forests in data center of large power grid (pp. She has 100 International and National papers published in various conferences held in India and aboard. It is a collection of machine learning algorithms for data mining tasks. The basic premise of the application is to utilize a computer application that can be trained to perform machine learning capabilities and derive useful information in the form of trends and patterns. Data mining applied to the electric power industry: Classification of short-circuit faults in transmission lines (p. 2007). Data mining is a predicting technique using the existing pattern. Random Forest is used for the classification of PQ disturbances [18] and fault record detection in data center of large power grid [19]. This paper presents the classification of power quality problems such as voltage sag, swell, interruption and unbalance using data mining algorithms: J48, Random Tree and Random Forest decision trees. Stockholm, Sweden: In Proc. Consequences of poor power quality An overview. During the training phase, model selection is influenced by measurement accuracy: parameters are chosen to maximize prediction accuracy on training data. Suresh, K., & Chandrashekhar, T. (2012). The voltage unbalance is created by a 3-phase unbalance fault. Privacy Policy As seen in the picture above, there are two possible predicted classes: yes and no, where X is expected no and is predicted to be a no in the model, Y is predicted yes but is actually no, Z is actually yes but predicted no, and L is actually yes and predicted yes. Soman, K. P., Diwakar, S., & Ajay, V. (2006). California Privacy Statement, Data extraction for classification and characterisation of power quality problems. Precision is a good statistic to employ when the costs of False Positive are high. The results obtained after testing the algorithms using training set are indicated in Table4. Google Scholar. The Profit chart is somewhat unique in Microsoft tools. . random is the model that will be automatically selected. Section 3 deals with the basics of data mining and explains about J48, Random Tree and Random Forest algorithms. International Journal of Innovative Research in Electrical, Electronics, Instrumentation and Control Engineering, 4(3), 137141. This Section also briefs about WEKA software used for implementing data mining for the classification purpose. Recall / Sensitivity actually calculates how many of the Actual Positives our model capture through labeling it as Positive (True Positive). Network What dimensionality curse in Data Science? These are more interpretable than other classifiers such as Artificial Neural Networks (ANN) and Support Vector Machines (SVM) because they combine simple questions about the data in an understandable way [9]. The circuit consists of a 33/11kV distribution substation connected to a 2km distribution line having a 11/0.433kV distribution transformer supplying to a load of 190kW, 140 kVAr [37]. on Scientific and Statistical Database Systems. The effect of data attributes on the classification accuracy and time taken for training the decision trees is also discussed. In the event that all the attributes are finished, or if the unambiguous result cannot be obtained from the available information, we assign this branch a target value that the majority of the items under this branch possesses. The header section contains relation declarations mentioning the name of the relation and attribute declarations listing the attributes (the columns in the data) with their types [38]. Selector

at the power frequency for durations from 0.5cycles to 1min. 8th Inter. J48 is compared with Random Forest in the classification of power quality disturbances and found that Random Forest is more accurate than J48 [20]. Following are the lift charts for different four models, random model, and the ideal model. Jeya Sheela, Y., & Krishnaveni, S. H. (2017). the available four. The first task involves categorizing data into groups based on some identifying features. She is presently pursuing Ph.D. in Power Quality at UCE, OU, Hyderabad. We may distinguish between classification and regression based on the nature of prediction. Classification of data is an important task in the data mining process that extracts models for describing classes and predicts target class for data instances. Thus, these algorithms use a tree representation, which helps in pattern classification in data sets, being hierarchically structured in a set of interconnected nodes. Pre-process stage of data mining in WEKA with 4 attributes. The accuracy of Nave Bayes reduces as the data size increases. The three phase voltages during an unbalanced fault are as shown in Fig. The only tab we have not discussed so far is the Mining Accuracy Chart tab. Vast and increasing volumes of data obtained from power quality monitoring system, requires the use of data mining technique for analyzing the data. Classification models are Nave Bayes Decision Trees, Neural Network. In our case Y will be FP False Negatives (FN) These are cases in which we predicted no, and they are no. Anyone you share the following link with will be able to read this content: Sorry, a shareable link is not currently available for this article. 3. The simple tree structure of J48 is as shown in Fig. International Journal of Innovative Science, Engineering & Technology, 2(2), 438446. Random Trees have been introduced by Leo Breiman and Adele Cutler. Penang: IEEE Region 10 Conference, TENCON 2017.

Han, J., Kamber, M., & Pei, J. In data mining and machine learning, accuracy is a critical component in module performance because the success of a module is dependent on its accuracy since the accuracy of measurement shows how near it is to its true value. She guided 4 Ph.D. scholars. A Thesis, Central Connecticut State University, New Britain, Connecticut. Data mining techniques, instead, can analyze and cope intelligently with records containing missing values, as well as a mixture of qualitative and quantitative data, without tedious manual manipulation [31, 32]. However, you have the option of choosing a different data set for the evaluation purposes by using Akinola, S., & Oyabugbe, O. Prot Control Mod Power Syst 3, 29 (2018). International Journal of Advanced Research in Engineering and Applied Sciences, 4(5), 5667. An interruption occurs when the supply voltage or load current decreases to less than 0.1 p.u. This feature, which is able to tell us more about the data instances, so that we can classify them the best, is said to have the highest information gain.

Impact of attribute selection on the accuracy of multilayer perceptron. This gives us the error rate. Classification yields a categorical label, whereas regression yields a continuous function. Accuracy is not really a reliable metric for the real performance of a classifier when the number of samples in different classes vary greatly (unbalanced target) because it will yield misleading results. The performance of the algorithms is evaluated in both the cases to determine the best classification algorithm, and the effect of addition of the three attributes in the second case is studied, which depicts the advantages in terms of classification accuracy and training time of the decision trees. Precision / Confidence refers to how precise/accurate your model is in terms of how many of those anticipated positives are actually positive. The power quality monitoring requires storing large amount of data for analysis. S. Asha Kiranmai. The data is sampled at a frequency of 2kHz. In order to classify a new item, it first needs to create a decision tree based on the attribute values of the available training data. Comparative study of various decision tree classification algorithm using WEKA. Pandit, N., & Chakrasali, R. L. (2017).

File System Design Pattern, Infrastructure However, since we are using data mining outcomes for better business decisions, So, with the inclusion of these three simple attributes into the data, the data mining algorithms have trained better and their generalization capabilities are enhanced, leading to more accurate results. Http Power quality issues, problems, standards & their effects in industry with corrective means. 792 cases are another way around. Data mining: Theory and practice. The above values are defined as follows. Accuracy & = & \frac{\text{Number of correct predictions}}{\text{Total of all cases to be predicted}} \\ Collection Process (Thread) Presently working as Professor in EEE and coordinator in Centre for Energy Studies, Jawaharlal Nehru Technological University Hyderabad College of Engineering, Hyderabad. The trees that make up the Random Forest are built randomly selecting m (value fixed for all nodes) attributes in each node of the tree; where the best attribute is chosen to divide the node. 9, except for the number of attributes taken. Logical Data Modeling USA: Wily. (2013). Terms and Conditions, Decision trees naturally support classification problems with more than two classes and can be modified to handle regression problems. In the Input Selection, you can choose which models to evaluate. https://doi.org/10.1186/s41601-018-0103-3, DOI: https://doi.org/10.1186/s41601-018-0103-3. According to the experimental results, C5.0 model proved to have the best performance. Voltage sags can occur due to short circuits, overloads and starting of large motors. Web Services Versioning

In standard tree, each node is split using the best split among all attributes. He has been working with SQL Server for more than 15 years, written articles and coauthored books. Zhou, J., Ge, Z., Gao, S., & Yanli, X. Cryptography

the above data set is Decision Trees. J48 is an open source Java implementation of the C4.5 algorithm in the WEKA data mining tool. Power quality issues in Indian power distribution utilities and feasible solutions. relevant to different algorithms. These values can be arranged in a 2 2 matrix called contingency matrix, where we have the actual classes P and C on the rows, and the predicted classes P and C on the columns. The (error|misclassification) rates are good complementary metrics to overcome this problem. \text{Relative absolute error}= \frac{|p_1-a_1|+\dots+|p_n-a_n|}{|a_1-\bar{a}|+\dots+|a_n-\bar{a}|} Protection and Control of Modern Power Systems Olaru, C., & Wehenkel, L. (1999). From the results, it is seen that the Random Tree has a more overall accuracy (99.9943%) and takes less training time (1.86s) as compared to J48 and Random Forest algorithms. Automata, Data Type She was awarded Best Technical Paper Award for Electrical Engineering by Institution of Electrical Engineers in the year 2006. Google Scholar. \begin{array}{rrc} decision tree model predicted them as possible bike buyers.