Data Mining and the Grid (Knowledge Discovery from Data)
NASA's Earth-observing spacecraft produce large amounts of data. For instance, NASA's Terra mission instruments yields over 850 Gb of processed data daily, that is stored at eight distinct sites, called Distributed Active Archive Centers, which are only connected through a wideband network or grid. Analyzing and extracting useful information concealed in these and other large data repositories is a fundamental NASA task that creates "data products". These application data products can't be obtained with traditional techniques and require the use of advanced Data Mining and Grid technologies.
Grid Miner is an integrated data mining software system developed as one of the early applications on NASA's Information Power Grid (IPG). Students will participate in ongoing projects to implement and add truly distributed algorithms to this impressive system, as well as developing Earth Science data products from remote sensing instrument data using KDD methods available with Grid Miner or other data mining systems. We shall also consider porting the programmed algorithms to National Science Foundation's Teragrid. Desirable skills are programming experience (C/C++, Perl, Fortran, etc), and a familiarity or disposition to learn MPI, Globus, MatLab and Star*P.
The image to the right shows an example of synthetic data randomly generated by programs written by participants in SPHERE 2006. Synthetic data is created to produce predetermined results, and it is used to test the clustering software students are writing. In this example we observe four groups of data that are well captured by the software, as can be seen in the corresponding dendrogram graph. The graph was created from the clustering results by a recursive program written by participants.