Follow
Christian Engelmann
Christian Engelmann
Senior Scientist and Group Leader, Intelligent Systems and Facilities, Oak Ridge National Laboratory
Verified email at ornl.gov - Homepage
Title
Cited by
Cited by
Year
Proactive fault tolerance for HPC with Xen virtualization
AB Nagarajan, F Mueller, C Engelmann, SL Scott
Proceedings of the 21st annual international conference on Supercomputing, 23-32, 2007
5272007
Addressing failures in exascale computing
M Snir, RW Wisniewski, JA Abraham, SV Adve, S Bagchi, P Balaji, J Belak, ...
The International Journal of High Performance Computing Applications 28 (2 …, 2014
5142014
Detection and correction of silent data corruption for large-scale high-performance computing
D Fiala, F Mueller, C Engelmann, R Riesen, K Ferreira, R Brightwell
SC'12: Proceedings of the International Conference on High Performance …, 2012
3792012
Proactive process-level live migration in HPC environments
C Wang, F Mueller, C Engelmann, SL Scott
SC'08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 1-12, 2008
2472008
Combining partial redundancy and checkpointing for HPC
J Elliott, K Kharbas, D Fiala, F Mueller, K Ferreira, C Engelmann
2012 IEEE 32nd International Conference on Distributed Computing Systems …, 2012
2022012
Failures in large scale systems: Long-term measurement, analysis, and implications
S Gupta, T Patel, C Engelmann, D Tiwari
Proceedings of the International Conference for High Performance Computing …, 2017
1612017
Proactive fault tolerance using preemptive migration
C Engelmann, GR Vallee, T Naughton, SL Scott
2009 17th Euromicro International Conference on Parallel, Distributed and …, 2009
1502009
Functional partitioning to optimize end-to-end performance on many-core architectures
M Li, SS Vazhkudai, AR Butt, F Meng, X Ma, Y Kim, C Engelmann, ...
SC'10: Proceedings of the 2010 ACM/IEEE International Conference for High …, 2010
1172010
A job pause service under LAM/MPI+ BLCR for transparent fault tolerance
C Wang, F Mueller, C Engelmann, SL Scott
2007 IEEE International Parallel and Distributed Processing Symposium, 1-10, 2007
1142007
The case for modular redundancy in large-scale high performance computing systems
C Engelmann, HH Ong, SL Scott
Proceedings of the 8th IASTED international conference on parallel and …, 2009
1112009
NVMalloc: Exposing an aggregate SSD store as a memory partition in extreme-scale machines
C Wang, SS Vazhkudai, X Ma, F Meng, Y Kim, C Engelmann
2012 IEEE 26th International Parallel and Distributed Processing Symposium …, 2012
992012
A framework for proactive fault tolerance
G Vallee, K Charoenpornwattana, C Engelmann, A Tikotekar, ...
2008 Third International Conference on Availability, Reliability and …, 2008
922008
High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development
N DeBardeleben, J Laros, JT Daly, SL Scott, C Engelmann, B Harrod
Whitepaper, Dec, 2009
872009
System-level virtualization for high performance computing
G Vallee, T Naughton, C Engelmann, H Ong, SL Scott
16th Euromicro Conference on Parallel, Distributed and Network-Based …, 2008
852008
Super-scalable algorithms for computing on 100,000 processors
C Engelmann, A Geist
International Conference on Computational Science, 313-321, 2005
812005
Machine learning models for GPU error prediction in a large scale HPC system
B Nie, J Xue, S Gupta, T Patel, C Engelmann, E Smirni, D Tiwari
2018 48th Annual IEEE/IFIP International Conference on Dependable Systems …, 2018
802018
Hybrid checkpointing for MPI jobs in HPC environments
C Wang, F Mueller, C Engelmann, SL Scott
2010 IEEE 16th International Conference on Parallel and Distributed Systems …, 2010
802010
Redundant execution of HPC applications with MR-MPI
C Engelmann, S Böhm
Proceedings of the 10th IASTED International Conference on Parallel and …, 2011
772011
Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale
C Engelmann
Future Generation Computer Systems 30, 59-65, 2014
702014
xSim: The extreme-scale simulator
S Böhm, C Engelmann
2011 International Conference on High Performance Computing & Simulation …, 2011
662011
The system can't perform the operation now. Try again later.
Articles 1–20