DBMD implementation for Big Data Mining in the CEASE method

The fundamental tools to discover knowledge from big data was matrix composition. Here data generated by modern applications via cloud Computing. However, it is still inefficient or infeasible to process very big data using such a method in a single machine or through virtual machines. Moreover, big data are often distributedly collected data from various data centers and stored on different machines via scheduling algorithms. Thus, such data generally bear strong heterogeneous noise. It is essential and useful to develop distributed matrix decomposition for big data analytics. Such a method should scale up well, model the heterogeneous noise, and address the communication issue in a distributed system. To this end, we propose a distributed Bayesian matrix decomposition model (DBMD) for big data mining and clustering. Specifically, we adopt three strategies to implement the distributed computing including 1) the accelerated gradient descent, 2) the alternating direction method of multipliers (ADMM), and 3) the statistical inference. We investigate the theoretical convergence behaviors of these algorithms. To address the heterogeneity of the noise, we propose an optimal plug-in weighted average that reduces the variance of the estimation. Finally Comparison made between these algorithms to understand the result between them.


INTRODUCTION
Data visualization Technology was used in this project to handle the data processing. Data visualization technique that combines graph-based topology representation and dimensionality reduction methods to visualize the intrinsic data structure in a lowdimensional vector space. The application of graphs in clustering and visualization has several advantages. A graph of important edges (where edges characterize relations and weights represent similarities or distances) provides a compact representation of the entire complex data set. This text describes clustering and visualization methods that are able to utilize information hidden in these graphs, based on the synergistic combination of clustering, graph-theory, neural networks, data visualization, dimensionality reduction, fuzzy methods, and topology learning. The work contains numerous examples to aid in the understanding and implementation of the proposed algorithms.
The recent development of methods for extracting precise measurements of spatial gene expression patterns from threedimensional (Input) image data opens the way for new analyses of the complex gene regulatory networks controlling animal development. We present an integrated visualization and analysis framework that supports user-guided data clustering to aid exploration of these new complex data sets. The interplay of data visualization and clustering-based data classification leads to improved visualization and enables a more detailed analysis than previously possible. We discuss 1) the integration of data clustering and visualization into one framework, 2) the application of data clustering to Input gene expression data, 3) the evaluation of the number of clusters k in the context of Input gene expression clustering, and 4) the improvement of overall analysis quality via dedicated post processing of clustering results based on visualization. We discuss the use of this framework to objectively define spatial pattern boundaries and temporal profiles of genes and to analyze how mRNA patterns are controlled by their regulatory transcription factors.

RELATED WORK
Matrix decomposition is one of the fundamental tools to discover knowledge from big data generated by modern applications. However, it is still inefficient or infeasible to process very big data using such a method in a single machine. Moreover, big data are often distributedly collected and stored on different machines. Thus, such data generally bear strong heterogeneous noise. It is essential and useful to develop distributed matrix decomposition for big data analytics. Such a method should scale up well, model the heterogeneous noise, and address the communication issue in a distributed system. To this end, we propose a distributed Bayesian matrix decomposition model (DBMD) for big data mining and clustering. Specifically, we adopt three strategies to implement the distributed computing including 1) the accelerated gradient descent, 2) the alternating direction method of multipliers (ADMM), and 3) the statistical inference. We investigate the theoretical convergence behaviors of these algorithms. To address the heterogeneity of the noise, we propose an optimal plug-in weighted average that reduces the variance of the estimation. Synthetic experiments validate our theoretical results, and real-world experiments show that our algorithms scale up well to big data and achieves superior or competing performance compared to two typical distributed methods including Scalable-NMF and scalable k-means++. Website : www.ijirmps.org Email : editor@ijirmps.org 178 K-means is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed apriori. The main idea is to define k centers, one for each cluster. These centers should be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest center. When no point is pending, the first step is completed and an early group age is done. At this point we need to re-calculate k new centroids as barycenter of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new center. A loop has been generated. As a result of this loop we may notice that the k centers change their location step by step until no more changes are done or in other words centers do not move any more. The matrix decomposition methods mentioned above have little relevance to the underlying computational architecture. They assumed that the program is running on a single machine, and an arbitrary number of data points are accessible instantaneously. However, the huge size of data often makes it impossible to handle all of them on a single machine. Many applications collect data distributedly from different sources (e.g., labs, hospitals). The communication between them is expensive due to the limited bandwidth, and direct data sharing also raise privacy concern. Moreover, data collected from different sources often bear strong heterogeneous noise. Therefore, developing efficient matrix decomposition methods in a distributed system is essential. are sequential, which limits its applicability to big data. Generally, scaling the k-means algorithm to distributed data is relatively easy due to its iterative nature. 'c' is the number of cluster centers.
Calculate the distance between each data point and cluster centers.
Assign the data point to the cluster center whose distance from the cluster center is minimum of all the cluster centers. Recalculate the new cluster center using: where, 'ci' represents the number of data points in i th cluster. Recalculate the distance between each data point and new obtained cluster centers. If no data point was reassigned then stop, otherwise repeat from step 3).

RESULT
A continual testing of individual software components when developed, involved evaluating and reviewing the software prototype identifying problem situations. With the project methodology using an evolutionary development strategy to systematically progress between each iteration, a standard set of small scale testing procedures were in place to deal with erroneous defects with the source code, filtering out unintended functionality. With each characteristic continually incorporated and verified this ensured the software prototype deliverable retained and prioritised further development attributes. To ascertain the prototype produced throughout the development phase was correct, a series of test were formalised and conducted. Involvement and creation of small scale throw away prototyping were the heart of the project quickly identifying and defining ISSN: 2349-7300

IJIRMPS2103032
Website : www.ijirmps.org Email : editor@ijirmps.org 179 pathways for the development to maintain. This method vastly improved and distinguished clear directions to take, with the minimum requirements and possible extensions reflected upon within each small prototype using the preceding design phase. Although the design phase in the initial prototypes did not provide an accurate correlation to what was actually produced in the UML models, it still enabled identification and analysis of the user interface and output documents produced needing to be factored into the prototype.

Fig: Throughput Comparision For Medium Data Request
Fig. show the throughput consumption and data size utilization at the cloud data center, respectively. The energy consumption in the case of our proposed K-MEAN CLUSTER based CEASE algorithm has high throughput when compared to ADG and ADMM migration because we set the dynamic upper threshold value by computing the median absolute deviation, and interquartile range of past data respectively. The throughput and the data size was high in the data mining and smart clustering. Below fig show the throughput consumption and small data size utilization at the cloud data center, respectively. The energy consumption in the case of our proposed K-MEAN CLUSTER based CEASE algorithm has high throughput when compared to ADG and ADMM migration because we set the dynamic upper threshold value by computing the median absolute deviation, and interquartile range of past data respectively. The throughput was high even at high data size in the data mining and smart clustering.. it reach around 420 MbpS even at 1500 GB data requested by many users   When the latency between the central machine and nodes is high the algorithm make traffic in the network. Moreover it affect the QoS in the entire system. So here we plotted the latency vs packet in the graph. The implemeded result shows that CEASE has a very low latency compared to the other algorithm. So as far as all concern our algorithm perform well in all aspects.

CONCLUSION
We proposed a distributed Bayesian matrix decomposition model for big data mining and clustering. Three distributed strategies (i.e., AGD, ADMM and CEASE) were adopted to solve it. Convergence rates of AGD and ADMM depend on different structural parameters and thus have different behaviors. In short, CEASE converges faster with the number of instances on each node machine increasing, the convergence rate of CEASE doesn't change much, but the convergence rate of AGD and ADMM change much. Empirically, CEASE also converges faster with the number of instances growing. To tackle the heterogeneous noise in the data, we propose an optimal plug-in weighted average scheme that significantly reduces the variance of the estimation. The proposed algorithms scale up well. The real-world experiments demonstrate that the proposed algorithms achieve superior or competitive performance. Both the Bayesian prior and the weighted average strategies reduce the influence of the highly noisy data.