MLCS 2020

Motivation

Large computing facilities and data centers already produce terabytes of monitoring-related data each day, ranging from low-level hardware telemetry and error data, to job scheduling and system logs, through natural language text from administrator troubleshooting tickets and notes. Meanwhile, we are on track to build computing facilities in the next 5 years that will be at least 25x larger than the current largest facilities. This will increase the design and management complexity of those systems for human operators that perform those processes manually today. Current state-of-the-art tools for human operators tend to focus on designing filters specifying “interesting” behavior cases that humans care about [1-3]. However, these tools are highly restrictive. They can only ever detect behaviors that were previously known -- they will never find new, or previously unknown, behavioral modes. Furthermore, such tools must be constantly updated, and will likely break if the system is upgraded or changed in a way that no longer matches these carefully-crafted filters. Meanwhile, machine learning and data science offer alternative techniques which have already demonstrated effectiveness for characterizing and extracting knowledge from large and complex datasets across a wide variety of domains. For these reasons, we are organizing the 2nd Workshop on Machine Learning for Computing Systems (MLCS). MLCS 2020 will provide a much-needed opportunity not only for cutting-edge research ideas to be shared, but also to bring together researchers across the disciplines of data science, machine learning, statistics, applied mathematics, systems design, systems monitoring, systems resilience, and hardware architecture to address a shared goal of better and more efficient monitoring and use of large-scale computing machines and facilities.

Theme and Audience

Recently, there is a rising interest in the use of machine learning techniques to better understand, analyze, manage, and design large-scale computing facilities [4-6]. Interdisciplinary research at the intersection of machine learning, data science, and systems has already produced advances in memory error mitigation [7-8], datacenter cooling [9], system log analysis [10], job scheduling [11], and database indexing [12], among others. Especially as the machine learning community builds a focus on human-understandable models [13-15], learned models become extremely attractive for HPC-related decision support and development of data-driven tools to assist of human experts. Additionally, it is frequently the case that HPC-related problems are also related to open machine learning research areas, such as anomaly detection within near-natural language text (e.g. system logs, console logs, etc.), and there is a definite need for collaboration between HPC domain experts and statistical modeling / machine learning experts. In fact, organizers of this workshop are experiencing continuing success with the ATLAS project [16,17], which brings together national laboratories and academia in a healthy collaboration that has enabled the public release of new datasets that are already in use by a variety of researchers [18-21].

While there is a rise in ML-for-Systems workshops [22-23], the community seems to be fragmented based on their background in industry, academia, or national laboratories. The audience we target through MLCS ‘20 is intentionally broad and inclusive, ranging from seasoned machine learning and systems experts through students new to the field, and spanning across industry, academia, and government. While we aim to be a conduit for productive conversations between professional experts who may not otherwise connect, we will also welcome and encourage students and newcomers who may not have previously considered our interdisciplinary field. We will accomplish this by explicitly soliciting not only fully baked research results, but also works-in-progress, extended abstracts, and position papers (detailed in our CFP).

References

[1] “Event Log File Analysis.” https://www.solarwinds.com/topics/event-log-analyzer
[2] “Logalyze.” http://www.logalyze.com/
[3] “Log MX.” http://www.logmx.com/
[4] Evans, R., and Gao, J. (2016). DeepMind AI reduces Google data centre cooling bill by 40%.” https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/
[5] Carastan-Santos, D., and De Camargo, R. (2017). Obtaining dynamic scheduling policies with simulation and machine learning. The International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing).
[6] Marathe, A., Anirudh, R., Jain, N., Bhatele, A., Thiagarajan, J., Kailkhura, B., ... & Gamblin, T. (2017). Performance modeling under resource constraints using deep transfer learning. (No. LLNL-CONF-736726). Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States).
[7] Baseman, E., DeBardeleben, N., Blanchard, S., Moore, J., Tkachenko, O., Ferreira, K., Siddiqua, T., and Sridharan, V. (2018). Physics-informed machine learning for DRAM error modeling. 32nd IEEE Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems.
[8] Baseman, E., DeBardeleben, N., Ferreira, K., Sridharan, V., Siddiqua, T., and Tkachenko, O. (2017). Automating DRAM fault mitigation by learning from experience. The 47th IEEE/IFIP International Conference on Dependable Systems and Networks.
[9] Evans, R. A., Gao, J., Ryan, M. C., Dulac-Arnold, G., Scholz, J. K., & Hester, T. A. (2018). U.S. Patent Application No. 15/410,547.
[10] Baseman, E., Blanchard, S., Li, Z., and Fu, S. (2016). Relational synthesis of text and numeric data for anomaly detection on computing system logs. 15th IEEE International Conference on Machine Learning with Applications.
[11] Gaussier, E., Glesser, D., Reis, V., & Trystram, D. (2015, November). Improving backfilling by using machine learning to predict running times. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (p. 64). ACM.
[12] Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. (2018, May). The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data (pp. 489-504). ACM.
[13] Doshi-Velez, F., and Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608.
[14] Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135-1144). ACM.
[15] Lundberg, S., & Lee, S. I. (2017). A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874.
[16] Amvrosiadis, G., Park, J. W., Ganger, G. R., Gibson, G. A., Baseman, E., & DeBardeleben, N. (2018). On the diversity of cluster workloads and its impact on research results. In 2018 {USENIX} Annual Technical Conference ({USENIX}{ATC} 18) (pp. 533-546).
[17] Amvrosiadis, G., Kuchnik, M., Park, J.W., Cranor, C., Ganger, G.R., Moore, E., and DeBardeleben, N. (2018). The atlas cluster trace repository. ;login: The Usenix Magazine, Winter 2018, Vol. 43, No. 4.
[18] Chung, A., Park, J. W., & Ganger, G. R. (2018, October). Stratus: cost-aware container scheduling in the public cloud. In Proceedings of the ACM Symposium on Cloud Computing (pp. 121-134). ACM.
[19] Ghandeharizadeh, S., & Huang, H. (2018, December). Hoagie: A Database and Workload Generator using Published Specifications. In 2018 IEEE International Conference on Big Data (Big Data) (pp. 3847-3852). IEEE.
[20] Agarwal, N., Greenberg, H., Blanchard, S., & DeBardeleben, N. (2018, November). SaNSA-the Supercomputer and Node State Architecture. In 2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) (pp. 69-78). IEEE.
[21] Souza, A. (2018). Application-aware resource management for datacenters (Doctoral dissertation, Department of computing science, Umeå university).
[22] Goldie, A., Mirhoseini, A., Raiman, J., Swersky, K., and Hashemi, M. (2018). Workshop on ML for systems at NeurIPS 2018. http://mlforsystems.org/
[23] Young, S., Patton, R., Keuper, J., and Houston, M. (2018) Machine learning in HPC environments workshop at SC 18. https://ornlcda.github.io/MLHPC2018/index.html
[24] “SIAM International Conference on Data Mining.” (2019). https://www.siam.org/conferences/cm/conference/sdm20

2nd workshop onMachine Learning for Computing Systems

Motivation

Theme and Audience

References

2nd workshop on
Machine Learning for Computing Systems