Performance Technology for Complex Parallel Systems Part 3 – Alternative Tools and Frameworks Bernd Mohr
Goals Learn about commercial performance analysis products for complex parallel systems - Vampir event trace visualization and analysis tool
- Vampirtrace event trace recording library
- GuideView OpenMP performance analysis tool
- VGV (integrated Vampir / GuideView environment)
Learn about future advanced components for automatic performance analysis and guidance - EXPERT automatic event trace analyzer
Discuss plans for performance tool integration
Vampir Visualization and Analysis of MPI PRograms Originally developed by Forschungszentrum Jülich Current development by Technical University Dresden Distributed by PALLAS, Germany
Vampir: General Description Offline trace analysis for message passing trace files Convenient user–interface / easy customization Scalability in time and processor–space Excellent zooming and filtering Display and analysis of MPI and application events: - User subroutines
- Point–to–point communication
- Collective communication
- MPI–2 I/O operations
Large variety of customizable (via context menus) displays for ANY part of the trace
Vampir: Main Window Trace file loading can be - Interrupted at any time
- Resumed
- Started at a specified time offset
Provides main menu - Access to global and process local displays
- Preferences
- Help
Trace file can be re–written (re–grouped symbols)
Vampir: Timeline Diagram Coloring by group Message lines can be colored by tag or size
Vampir: Timeline Diagram (Message Info) Source–code references are displayed if recorded in trace
Vampir: Support for Collective Communication For each process: locally mark operation Connect start/stop points by lines
Vampir: Collective Communication Display
Vampir: MPI-I/O Support MPI I/O operations shown as message lines to separate I/O system time line
Vampir: Execution Statistics Displays Aggregated profiling information: execution time, # calls, inclusive/exclusive Available for all/any group (activity) Available for all routines (symbols) Available for any trace part (select in timeline diagram)
Vampir: Communication Statistics Displays Bytes sent/received for collective operations Message length statistics
Vampir: Other Features Parallelism display Powerful filtering and trace comparison features All diagrams highly customizable (through context menus)
Vampir: Process Displays
Vampir: New Features New Vampir versions (3 and 4) - New core (dramatic timeline speedup, significantly reduced memory footprint)
- Load–balance analysis display
- Hardware counter value displays
- Thread analysis
- show hardware and grouping structure
- Improved statistics displays
- Raised scalability limits: can now analyse 100s of processes/threads
Vampir: Load Balance Analysis State Chart display Aggregated profiling information: execution time, # calls, inclusive/exclusive For all/any group (activity) For all routines (symbols) For any trace part
Vampir: HPM Counter
Vampir: Cluster Timeline
Vampir: Cluster Timeline SMP or Grid Nodes Display
Vampir: Cluster Timeline(2) Display of messages between nodes enabled
Vampir: Improved Message Statistics Display
Release Schedule Vampir/SX and Vampirtrace/SX - Version 1 available via NEC Japan
- Version 2 is ready for release
Vampir/SC and Vampirtrace/SC - Version 3 is available from Pallas
- Version 4 scheduled for Q4/2001
Vampir and Vampirtrace - Version 3 is scheduled for Q4/2001
- Version 4 will follow in 2002
Vampir Feature Matrix
Vampirtrace Commercial product of Pallas, Germany Library for Tracing of MPI and Application Events - Records MPI point-to-point communication
- Records MPI collective communication
- Records MPI–2 I/O operations
- Records user subroutines (on request)
- Records source–code information (some platforms)
- Support for shmem (Cray T3E)
Uses the PMPI profiling interface http://www.pallas.de/pages/vampirt.htm
Vampirtrace: Usage Record MPI–related information - Re–link a compiled MPI application (no re-compilation)
- {f90,cc,CC} *.o -o myprog -L$(VTHOME)/lib -lVT -lpmpi -lmpi
- Re-link with -vt option to MPICH compiler scripts
- {mpif90,mpicc,mpiCC} -vt *.o -o myprog
- Execute MPI binary as usual
Record user subroutines - Insert calls to Vampirtrace API (portable, but inconvenient)
- Use automatic instrumentation (NEC SX, Fujitsu VPP, Hitachi SR)
- Use instrumentation tool (Cray PAT, dyninst, ...)
Vampirtrace Instrumentation API (C / C++) Calls for recording user subroutines VT calls can only be used between MPI_Init and MPI_Finalize! Event numbers used must be globally unique Selective tracing: VT_traceoff(),VT_traceon()
VT++.h – C++ Class Wrapper for Vampirtrace Same tricks can be used to wrap other C++ tracing APIs Usage:
Vampirtrace Instrumentation API (Fortran) Calls for recording user subroutines Selective tracing: VTTRACEOFF(),VTTRACEON()
Vampirtrace: Runtime Configuration Trace file collection and generation can be controlled by using a configuration file - Trace file name, location, size, flush behavior
- Activation/deactivation of trace recording for specific processes, activities (groups of symbols), and symbols
Activate a configuration file with environment variables - VT_CONFIG name of configuration file (use absolute pathname if possible)
- VT_CONFIG_RANK MPI rank of process which should read and process configuration file
Reduce trace file sizes - Restrict event collection in a configuration file
- Use selective tracing functions
Vampirtrace: Configuration File Example Be careful to record complete message transfers! See Vampirtrace User's Guide for complete description
New Features – Tracing New Vampirtrace versions (3 and 4) - New core (significantly reduce memory and runtime overhead)
- Better control of trace buffering and flush files
- New filtering options
- Event recording by thread
- Support of MPI–I/O
- Hardware counter data recording (PAPI)
- Support of process/thread groups
Vampirtrace Feature Matrix
GuideView Commercial product of KAI OpenMP Performance Analysis Tool Looks for OpenMP performance problems - Load imbalance, synchronization, false sharing
Works from execution trace(s) Compile with Guide, link with instrumented library - guidec++ -WGstats myprog.cpp -o myprog
- guidef90 -WGstats myprog.f90 -o myprog
- Run with real input data sets
- View traces with guideview
http://www.kai.com/parallel/kappro/
GuideView: Whole Application View Different - Number of processors
- Datasets
- Platforms
GuideView: Per Thread View
GuideView: Per Section View
GuideView: Analysis of hybrid Applications Generate different Guide execution traces for each node - Run with node-local file system as current directory
- Set trace file name with environment variable KMP_STATSFILE
- point to file in node-local file system KMP_STATSFILE=/node-local/guide.gvs
- use special meta-character sequences (%H: hostname, %I: pid, %P: number of threads used) KMP_STATSFILE=guide-%H.gvs
Use "compare-multiple-run" feature to display together Just a hack, better: use VGV!
VGV – Architecture Combine well–established tools - Guide and GuideView from KAI/Intel
- Vampir/Vampirtrace from Pallas
Guide compiler inserts instrumentation Guide runtime system collects thread–statistics PAPI is used to collect HPM data Vampirtrace handles event–based performance data acquisition and storage Vampir is extended by GuideView–style displays
VGV – Architecture
VGV – Usage Use Guide compilers by KAI - guidef77, guidef90
- guidec, guidec++
Include instrumentation flags (links with Guide RTS and Vampirtrace) Instrumentation can record - Parallel regions
- MPI activity
- Application routine calls
- HPM data
Trace file collection and generation controlled by configuration file
Vampir: MPI Performance Analysis
GuideView: OpenMP Performance Analysis
Vampir: Detailed Thread Analysis
Availability and Roadmap –version available (register with Pallas or KAI/Intel) - IBM SP running AIX
- IA 32 running Linux
- Compaq Alpha running Tru64
General release scheduled for Q1/2002 Improvements in the pipeline - Scalability enhancements
- Ports to other platforms
KOJAK Overview Kit for Objective Judgement and Automatic Knowledge-based detection of bottlenecks Long-term goal: Design and Implementation of a - Portable, Generic, Automatic
Performance Analysis Environment Current Focus - Event Tracing
- Clusters of SMP
- MPI, OpenMP, and Hybrid Programming Model
http://www.fz-juelich.de/zam/kojak/
Motivation Automatic Performance Analysis
Motivation Automatic Performance Analysis (2)
Automatic Analysis Example: Late Sender
Automatic Analysis Example (2): Wait at NxN
EXPERT: Current Architecture
Event Tracing Event Processing, Investigation, and LOGging (EPILOG) Open (public) event trace format and API for reading/writing trace records Event Types: region enter and exit, collective region enter and exit, message send and receive, parallel region fork and join, and lock aquire and release Supports - Hierarchical cluster hardware
- Source code information
- Performance counter values
Thread-safe implementation
Instrumentation Instrument user application with EPILOG calls Done: basic instrumentation - User functions and regions:
- undocumented PGI compiler (and manual) instrumentation
- MPI calls:
- wrapper library utilizing PMPI
- OpenMP:
- source-to-source instrumentation
Future work: - Tools for Fortran, C, C++ user function instrumentation
- Object code and dynamic instrumentation
Instrumentation of OpenMP Constructs OpenMP Pragma And Region Instrumentor Source-to-Source translator to insert POMP calls around OpenMP constructs and API functions Done: Supports - Fortran77 and Fortran90, OpenMP 2.0
- C and C++, OpenMP 1.0
- POMP Extensions
- EPILOG and TAU POMP implementations
- Preserves source code information (#line line file)
Work in Progress: Investigating standardization through OpenMP Forum
POMP OpenMP Performance Tool Interface OpenMP Instrumentation - OpenMP Directive Instrumentation
- OpenMP Runtime Library Routine Instrumentation
POMP Extensions - Runtime Library Control (init, finalize, on, off)
- (Manual) User Code Instrumentation (begin, end)
- Conditional Compilation (#ifdef _POMP, !$P)
- Conditional / Selective Transformations ([no]instrument)
Example: !$OMP PARALLEL DO
OpenMP API Instrumentation Transform - omp_#_lock() pomp_#_lock()
- omp_#_nest_lock() pomp_#_nest_lock()
- [ # = init | destroy | set | unset | test ]
POMP version - Calls omp version internally
- Can do extra stuff before and after call
Example: TAU POMP Implementation TAU_GLOBAL_TIMER(tfor, "for enter/exit", "[OpenMP]", OpenMP); void pomp_for_enter(OMPRegDescr* r) { #ifdef TAU_AGGREGATE_OPENMP_TIMINGS TAU_GLOBAL_TIMER_START(tfor) #endif #ifdef TAU_OPENMP_REGION_VIEW TauStartOpenMPRegionTimer(r); #endif } void pomp_for_exit(OMPRegDescr* r) { #ifdef TAU_AGGREGATE_OPENMP_TIMINGS TAU_GLOBAL_TIMER_STOP(tfor) #endif #ifdef TAU_OPENMP_REGION_VIEW TauStopOpenMPRegionTimer(r); #endif }
OPARI: Basic Usage (f90) Reset OPARI state information Call OPARI for each input source file - opari file1.f90 ... opari fileN.f90
Generate OPARI runtime table, compile it with ANSI C - opari -table opari.tab.c cc -c opari.tab.c
Compile modified files *.mod.f90 using OpenMP Link the resulting object files, the OPARI runtime table opari.tab.o and the TAU POMP RTL
OPARI: Makefile Template (C/C++)
OPARI: Makefile Template (Fortran)
Automatic Analysis EXtensible PERformance Tool (EXPERT) Programmable, extensible, flexible performance property specification Based on event patterns Analyzes along three hierarchical dimensions - Performance properties (general specific)
- Dynamic call tree position
- Location (machine node process thread)
Done: fully functional demonstration prototype
Example: Late Sender (blocked receiver)
Example: Late Sender (2) class LateSender(Pattern): # derived from class Pattern def parent(self): # "logical" parent at property level return "P2P" def recv(self, recv): # callback for recv events recv_start = self._trace.event(recv['enterptr']) if (self._trace.region(recv_start['regid'])['name'] == "MPI_Recv"): send = self._trace.event(recv['sendptr']) send_start = self._trace.event(send['enterptr']) if (self._trace.region(send_start['regid'])['name'] == "MPI_Send"): idle_time = send_start['time'] - recv_start['time'] if idle_time > 0 : locid = recv_start['locid'] cnode = recv_start['cnodeptr'] self._severity.add(cnode, locid, idle_time)
Performance Properties (1) [100% = (timelast event - time1st event) * number of locations] Total # Execution + Idle Threads time - Execution # Sum of exclusive time spent in each region
- Idle Threads # Time wasted in idle threads while executing “sequential” code
Execution - MPI # Time spent in MPI functions
- OpenMP # Time spent in OpenMP regions and API functions
- I/O # Time spent in (sequential) I/O
Performance Properties (2) MPI - Communication # Sum of Collective, P2P, 1-sided
- Collective # Time spent in MPI collective communication operations
- P2P # Time spent in MPI point-to-point communication operations
- 1-sided # Time spent in MPI one-sided communication operations
- I/O # Time spent in MPI parallel I/O functions (MPI_File*)
- Synchronization # Time spent in MPI_Barrier
Performance Properties (3) Collective - Early Reduce # Time wasted in root of N-to-1 operation by waiting for 1st sender (MPI_Gather, MPI_Gatherv, MPI_Reduce)
- Late Broadcast # Time wasted by waiting for root sender in 1-to-N operation (MPI_Scatter, MPI_Scatterv, MPI_Bcast)
- Wait at N x N # Time spent waiting for last participant at NxN operation (MPI_All*, MPI_Scan, MPI_Reduce_scatter)
Performance Properties (4) P2P - Late Receiver # Blocked sender
- Messages in Wrong Order # Receiver too late because waiting for another message from same sender
- Late Sender # Blocked receiver
- Messages in Wrong Order # Receiver blocked because waiting for another message from same sender
- Patterns related to non-blocking communication
- Too many small messages
Performance Properties (5) OpenMP - Synchronization # Time spent in OpenMP barrier and lock operations
- Barrier # Time spent in OpenMP barrier operations
- Implicit
- Load Imbalance at Parallel Do, Single, Workshare
- Not Enough Sections
- Explicit
- Lock Competition # Time wasted in omp_set_lock by waiting for lock release
- Flush
Expert Result Presentation Interconnected weighted tree browser Scalable still accurate Each node has weight - Percentage of CPU allocation time
- I.e. time spent in subtree of call tree
- Collapsed (including weight of descendants)
- Expanded (without weight of descendants)
Displayed using - Color: allows to easily identify hot spots (bottlenecks)
- Numerical value: Detailed comparison
Performance Properties View
Dynamic Call Tree View Property “Idle Threads” - Mapped to call graph location of master thread
- highlights phases of “sequential” execution
Locations View Supports locations up to Grid scale Easily allows exploration of load balance problems on different levels [ Of course, Idle Thread Problem only applies to slave threads ]
Performance Properties View (2) Interconnected weighted trees: Selecting another node in one tree effects tree display right of it
Dynamic Call Tree View
Locations View (2): Relative View
Automatic Performance Analysis Automatic Performance Analysis: Resources and Tools http://www.fz-juelich.de/apart/ ESPRIT Working Group 1999 - 2000 IST Working Group 2001 - 2004 16 members worldwide Prototype Tools (Paradyn, Kappa-PI, Aurora, Peridot, KOJAK/EXPERT, TAU)
Performance Analysis Tool Integration Complex systems pose challenging performance analysis problems that require robust methodologies and tools New performance problems will arise - Instrumentation and measurement
- Data analysis and presentation
- Diagnosis and tuning
No one performance tool can address all concerns Look towards an integration of performance technologies - Evolution of technology frameworks to address problems
- Integration support to link technologies to create performance problem solving environments
Dostları ilə paylaş: |