United States Patent Application20020129339
Kind CodeA1
Callahan, Charles David II ; et al.September 12, 2002

Parallelism performance analysis based on execution trace information
Abstract
A system for conducting performance analysis for executing tasks. The analysis involves generating a variety of trace information related to performance measures, including parallelism-related information, during execution of the task. In order to generate the trace information, target source code of interest is compiled in such a manner that executing the resulting executable code will generate execution trace information composed of a series of events. Each event stores trace information related to a variety of performance measures for the one or more processors and protection domains used. After the execution trace information has been generated, the system can use that trace information and a trace information description file to produce useful performance measure information. The trace information description file contains information that describes the types of execution events as well as the structure of the stored information. The system uses the trace information description file to organize the information in the trace information file, extracts a variety of types of performance measure information from the organized trace information, and formats the extracted information for display. The system can use default or user-defined functions to extract and format trace information for display. After the system displays one or more types of performance measure information, a user of the system can then interact with the system in a variety of ways to obtain other useful performance analysis information.

Inventors:Callahan; Charles David II (Seattle, WA), Shields; Keith Arnett  (Seattle, WA), Briggs; Preston Pengra III  (Seattle, WA)
Correspondence Name and Address:PATENT-SEA P.O. BOX 1247
PERKINS COIE LLP
SEATTLE
WA
98111-1247
US
Series Code:825434
Filed:April 3, 2001
U.S. Current Class:717/127
U.S. Class at Publication:717/127
Intern'l Class:G06F 009/44

Claims


1. A computer-implemented method for analyzing trace information generated during execution of multiple threads of a software program on a first computer, the first computer having multiple processors that each have multiple protection domains that are each able to execute at least one of the multiple threads, each processor having a counter indicating a number of instruction holes during which an instruction is not executed by the processor, each protection domain having a counter indicating a number of instructions issued in the protection domain by all executing threads, the method comprising: receiving an indication of trace information reflecting a series of events that occurred during the execution, each event associated with execution of one of the multiple threads by one of the protection domains of one of the processors and each event having associated values in the trace information of variables maintained by the executing software program, by the one protection domain, and/or by the one processor; for each of a plurality of periods of time during which the execution was occurring, determining from the trace information a number of instructions executed for the software program during the period of time by identifying multiple protection domains that each executed at least one of the multiple threads during at least a portion of the period of time; for each of the identified protection domains, determining a change in the value of the issued instructions counter of the protection domain during the period of time; determining if all of the instructions issued in the protection domain during the period of time were for one of the multiple threads; when it is determined that all of the instructions issued in the protection domain during the period of time were for one of the multiple threads, calculating a value for the number of instructions executed for the software program during the period of time by the protection domain to be the determined change; and when it is determined that all of the instructions issued in the protection domain during the period of time were not for one of the multiple threads, calculating a value for the number of instructions executed for the software program during the period of time by the protection domain to be a portion of the determined change that corresponds to a portion of the period of time during which at least one thread for the software program was executing in the protection domain; and determining the number of instructions executed for the software program during the period of time to be a sum of the calculated values for each of the identified protection domains; and determining from the trace information a number of instruction slots available for execution of the instructions of software program during the period of time by identifying processors that each executed at least one of the multiple threads during the period of time; for each of the identified processors, determining a change in the value of the instruction holes counter of the processor during the period of time; and if all of the instruction holes that occurred during the period of time were attributable to the software program, calculating a value for the number of instruction holes for the processor that are attributable to the software program during the period of time to be the determined change in the value of the instruction holes counter; calculating a value for the number of instruction holes that are attributable to the software program during the period of time by all of the identified processors to be a sum of the calculated values for each of the identified processors; and determining the number of instruction slots available for execution of the instructions of software program during the period of time to be a sum of the determined number of instructions executed for the software program during the period of time and of the calculated value for the number of instruction holes that are attributable to the software program during the period of time; and presenting to a user an indication of the determined number of executed instructions for each of the periods of time and an indication of the determined number of available instruction slots for each of the periods of time.

2. The method of claim 1 wherein only one software program can execute in a protection domain at any point in time, and wherein when it is determined that all of the instructions issued in a protection domain during a period of time were not for one of the multiple threads, the calculating of a value for the number of instructions executed for the software program during the period of time by the protection domain includes: determining from the trace information at least one swap event that occurred in the protection domain during the period of time such that the software program is swapped into the protection domain so as to commence execution of the software program or such that the software program is swapped out of the protection domain so as to suspend execution of the software program; retrieving for each of the determined swap events an associated value in the trace information of the issued instructions counter of the protection domain; and using the retrieved associated values to calculate the value for the number of instructions executed for the software program during the period of time by the protection domain to include only increments to the issued instructions counter that occurred while the software program is swapped into the protection domain.

3. The method of claim 1 wherein, for at least one of the identified protection domains for at least one of the periods of time, there are no variable values in the trace information indicating a value for the issued instructions counter of that protection domain at an end of that period of time, and wherein the determining of a second value for the issued instructions counter of that protection domain at the end of that period of time includes estimating the second value based on an extrapolation between earlier and later values for that issued instructions counter.

4. The method of claim 1 wherein, for at least one of the identified processors for at least one of the periods of time, there are no variable values in the trace information indicating a value for the instruction holes counter of that processor at an end of that period of time, and wherein the determining of a second value for the instruction holes counter of that processor at the end of that period of time includes estimating the second value based on an extrapolation between earlier and later values for that instruction holes counter.

5. The method of claim 1 wherein, for at least one of the identified protection domains for at least one of the periods of time, no event occurred during the execution of the software program for that protection domain for that period of time, and including estimating the first and second values for the issued instructions counter of that protection domain.

6. The method of claim 1 wherein, for at least one of the identified processors for at least one period of time, at least one other software program is executing during that period of time, and including, for each of the at least one identified processors: when it is determined that all of the instruction holes that occurred during that period of time were not attributable to the software program, calculating a value for the number of instruction holes for that processor that are attributable to the at least one other software programs during that period of time; and calculating a value for the number of instruction holes for that processor that are attributable to the software program during that period of time to be a difference of a total number of instruction holes for that processor during that period of time and the calculated value for the number of instruction holes for that processor that are attributable to the at least one other software programs.

7. The method of claim 6 wherein the at least one other software programs include only an operating system program, and wherein the calculated value for the number of instruction holes for each processor that are attributable to the operating system are zero.

8. The method of claim 1 wherein at least one of processors performing the execution has multiple streams performing the execution such that each of the multiple streams executes at least one of the threads, and including displaying information about the streams.

9. The method of claim 1 wherein the presenting to the user of the indication of the determined number of executed instructions for each of the periods of time and of the indication of the determined number of available instruction slots for each of the periods of time includes displaying a graph including the indications.

10. The method of claim 9 wherein the displayed indication of the determined number of available instruction slots for each period of time includes a displayed indication of the calculated number of instruction holes that are attributable to the software program during the period of time, with the calculated number of instruction holes displayed in such a manner that a user can visually aggregate the displayed indication of the determined number of executed instructions for that period of time with the displayed indication of calculated number of instruction holes for that period of time.

11. The method of claim 9 wherein the displayed graph includes a time-based axis, and wherein the displayed indications of the determined number of executed instructions and the determined number of available instruction slots for each of the periods of time are points on the graph.

12. The method of claim 11 wherein the displayed graph includes an origin with at least two axes, and including, after the displaying of the indications of the determined number of executed instructions and of the determined number of available instruction slots, redefining at least one of the axes based on a new indicated displayed location.

13. The method of claim 1 including, for at least one of the periods of time, presenting an indication of a logical code block of the software program that was executing during that period of time.

14. The method of claim 13 wherein the presented indication of the logical code block is a name of the logical code block.

15. The method of claim 13 wherein the presented indication of the logical code block is source code of the logical code block.

16. The method of claim 13 wherein the logical code block is a function.

17. The method of claim 1 including presenting at least some of the variable values from the trace information in a tabular format.

18. The method of claim 1 wherein the number of processors identified during each of the periods of time is greater than one, and wherein the determined number of available instruction slots for each of the periods of time is the identified number of processors for that period of time.

19. The method of claim 1 wherein a first number of processors identified during a first period of time is distinct from a second number of processors identified during a second period of time.

20. The method of claim 1 wherein the identified number of processors for at least one of the periods of time is greater than one, and wherein information for each of the processors is aggregated during the presenting of information for those periods of time.

21. The method of claim 20 including normalizing the information for each of the processors prior to the presenting of the information by determining a value for the instruction holes counter of each of the processors at a beginning of the execution of the software program and by subtracting the determined value for each instruction hole counter from each later value of that instruction hole counter.

22. The method of claim 1 wherein information to be presented is determined based on a specification provided by a user.

23. The method of claim 22 wherein the specification is used at the time of the presenting to determine the information to be presented.

24. The method of claim 1 including presenting information about memory references performed during at least one of the periods of time.

25. The method of claim 1 including presenting information about FLOPS performed during at least one of the periods of time.

26. The method of claim 1 including presenting information about a number of the threads in existence during at least one of the periods of time.

27. The method of claim 1 including presenting information about a number of the threads that are blocked during at least one of the periods of time.

28. The method of claim 1 including presenting information about a number of the threads that are ready for execution during at least one of the periods of time.

29. The method of claim 1 including presenting information about contention for locks during at least one of the periods of time.

30. The method of claim 1 wherein the presenting of information is in response to a request by a user.

31. The method of claim 1 wherein the presenting of information is performed automatically without user intervention.

32. The method of claim 1 including generating the indicated trace information.

33. The method of claim 1 wherein the software program is an executable version of a source code program such that execution of the executable version will generate the indicated trace information.

34. The method of claim 33 including generating the executable version by compiling the source code program such that a group of instructions is added to the source code program at a location specified by the user, the added instructions when executed to generate trace information for a type of event specified by the user.

35. The method of claim 33 including generating the executable version by compiling the source code program such that groups of instructions are added to the source code program at a plurality of automatically identified locations of the software, the added instructions when executed to generate trace information of a specified type.

36. The method of claim 35 wherein the automatically identified locations include beginnings of functions having more lines than a threshold.

37. The method of claim 35 wherein the automatically identified locations include ends of functions having more lines than a threshold.

38. The method of claim 35 wherein the automatically identified locations include locations where a compiler adds instructions related to parallelizing the execution of the executable version.

39. The method of claim 38 wherein the instructions related to parallelizing the execution of the executable version are instructions related to a fork.

40. The method of claim 38 wherein the instructions related to parallelizing the execution of the executable version are instructions related to a join.

41. The method of claim 34 wherein, for at least some of the groups of instructions to be added to the executable program to generate trace information of a specified type, an additional group of instructions is added to the source code program that when executed create a descriptor object for that group of instructions.

42. The method of claim 41 wherein the additional groups of instructions are added to the source code program in such a manner that the additional groups of instructions will be executed before any of the groups of instructions.

43. The method of claim 1 including generating the indicated trace information by executing the executable version.

44. The method of claim 43 wherein, during the executing of the executable version, execution of each of the added groups of instructions adds current values of one or more of the variables to the generated trace information.

45. The method of claim 43 wherein, during the executing of the executable version and before the execution of an added group of instructions, an additional group of added instructions is executed and creates a descriptor object for that added group of instructions.

46. The method of claims 43 wherein information is added to the generated indicated trace information by an executing program other than the software program.

47. The method of claim 46 wherein the other executing program is an operating system.

48. The method of claim 46 wherein the other executing program is a background daemon.

49. The method of claim 1 including analyzing the trace information to extract execution information corresponding to the events that occurred during the execution.

50. The method of claim 49 wherein the analyzing of the trace information includes using a description of specified types of events to identify groups of trace information each reflecting execution of a group of added instructions.

51. The method of claim 50 wherein the analyzing of the trace information includes, when information for an event in the trace information is separated in multiple non-contiguous portions, creating a mapping to assist in retrieving the information for the event.

52. The method of claim 51 including using the identified groups of trace information and the created mapping to extract current values for execution information.

53. The method of claim 49 wherein the analyzing of the trace information includes modifying current values of events so as to be normalized with respect to beginning of the execution.

54. The method of claim 49 wherein the analyzing of the trace information includes modifying current values of events so that the values reflect only the execution of the multiple threads.

55. The method of claim 49 including using information retrieved from created descriptor objects in the trace information.

56. A computer system for analyzing execution of multiple threads of a software program on a first computer, the first computer having multiple processors that each have multiple protection domains that are each able to execute at least one of the multiple threads, each processor having a counter indicating a number of instruction slot holes during which an instruction is not executed by the processor, each protection domain having a counter indicating a number of instructions executed in the protection domain by all executing threads, comprising: means for analyzing that receives an indication of trace information reflecting a series of events that occurred during the execution, each event associated with execution of one of the multiple threads by one of the protection domains of one of the processors and each event having associated values in the trace information of variables maintained by the executing software program, by the one protection domain, and/or by the one processor; for each of a plurality of periods of time during which the execution was occurring, determines from the trace information a number of instructions executed for the software program during the period of time by identifying multiple protection domains that each executed at least one of the multiple threads during at least a portion of the period of time; calculating for each of the identified protection domains a value for the number of instructions executed for the software program during the period of time by the protection domain to be a portion of a change in the executed instructions counter of the protection domain during the period of time such that the portion of the change corresponds to a portion of the period of time during which at least one thread for the software program was executing in the protection domain; and determining the number of instructions executed for the software program during the period of time to be a sum of the calculated values for each of the identified protection domains; and for each of the plurality of periods of time during which the execution was occurring, determines from the trace information a number of instruction slots available for execution of the instructions of software program during the period of time by identifying processors that each executed at least one of the multiple threads during the period of time; calculating for each of the identified processors a value for the number of instruction slot holes for the processor that are attributable to the software program during the period of time to be a portion of a change in the instruction slot holes counter of the processor during the period of time such that other portions of the change corresponding to execution of other software programs by the processor are not included in the portion; and determining the number of instruction slots available for execution of the instructions of software program during the period of time to be a sum of the determined number of instructions executed for the software program during the period of time and of a sum of the calculated values for each of the identified processors for the period of time; and means for displaying an indication of the determined number of executed instructions for each of the periods of time and an indication of the determined number of available instruction slots for each of the periods of time.

57. A computing device for analyzing execution of multiple threads of a software program on a first computer, the first computer having multiple processors that each have multiple protection domains that are each able to execute at least one of the multiple threads, each processor having a counter indicating a number of instruction slot holes during which an instruction is not executed by the processor, each protection domain having a counter indicating a number of instructions executed in the protection domain by all executing threads, comprising: a trace information receiver component capable of receiving an indication of trace information reflecting a series of events that occurred during the execution, each event associated with execution of one of the multiple threads by one of the protection domains of one of the processors and each event having associated values in the trace information of variables maintained by the executing software program, by the one protection domain, and/or by the one processor; a trace information analysis component capable of, for each of a plurality of periods of time during which the execution was occurring, determining from the trace information a number of instructions executed for the software program during the period of time by identifying multiple protection domains that each executed at least one of the multiple threads during at least a portion of the period of time; for each of the identified protection domains, determining a change in the value of the issued instructions counter of the protection domain during the period of time; determining if all of the instructions issued in the protection domain during the period of time were for one of the multiple threads; when it is determined that all of the instructions issued in the protection domain during the period of time were for one of the multiple threads, calculating a value for the number of instructions executed for the software program during the period of time by the protection domain to be the determined change; and when it is determined that all of the instructions issued in the protection domain during the period of time were not for one of the multiple threads, calculating a value for the number of instructions executed for the software program during the period of time by the protection domain to be a portion of the determined change that corresponds to a portion of the period of time during which at least one thread for the software program was executing in the protection domain; and determining the number of instructions executed for the software program during the period of time to be a sum of the calculated values for each of the identified protection domains; and determining from the trace information a number of instruction slots available for execution of the instructions of software program during the period of time by identifying processors that each executed at least one of the multiple threads during the period of time; for each of the identified processors, determining a change in the value of the instruction holes counter of the processor during the period of time; and if all of the instruction holes that occurred during the period of time were attributable to the software program, calculating a value for the number of instruction holes for the processor that are attributable to the software program during the period of time to be the determined change in the value of the instruction holes counter; calculating a value for the number of instruction holes that are attributable to the software program during the period of time by all of the identified processors to be a sum of the calculated values for each of the identified processors; and determining the number of instruction slots available for execution of the instructions of software program during the period of time to be a sum of the determined number of instructions executed for the software program during the period of time and of the calculated value for the number of instruction holes that are attributable to the software program during the period of time; and a presentation component capable of presenting to a user an indication of the determined number of executed instructions for each of the periods of time and an indication of the determined number of available instruction slots for each of the periods of time.

58. The computing device of claim 57 wherein the trace information receiver component, trace information analysis component and presentation component are executing in memory of the computing device.

59. A computer-readable medium whose contents cause a computing device to analyze execution of multiple threads of a software program on a first computer, the first computer having multiple processors that each have multiple protection domains that are each able to execute at least one of the multiple threads, each processor having a counter indicating a number of instruction holes during which an instruction is not executed by the processor, each protection domain having a counter indicating a number of instructions issued in the protection domain by all executing threads, by: receiving an indication of trace information reflecting a series of events that occurred during the execution, each event associated with execution of one of the multiple threads by one of the protection domains of one of the processors and each event having associated values in the trace information of variables maintained by the executing software program, by the one protection domain, and/or by the one processor; for each of a plurality of periods of time during which the execution was occurring, determining from the trace information a number of instructions executed for the software program during the period of time by identifying multiple protection domains that each executed at least one of the multiple threads during at least a portion of the period of time; for each of the identified protection domains, determining a change in the value of the issued instructions counter of the protection domain during the period of time; determining if all of the instructions issued in the protection domain during the period of time were for one of the multiple threads; when it is determined that all of the instructions issued in the protection domain during the period of time were for one of the multiple threads, calculating a value for the number of instructions executed for the software program during the period of time by the protection domain to be the determined change; and when it is determined that all of the instructions issued in the protection domain during the period of time were not for one of the multiple threads, calculating a value for the number of instructions executed for the software program during the period of time by the protection domain to be a portion of the determined change that corresponds to a portion of the period of time during which at least one thread for the software program was executing in the protection domain; and determining the number of instructions executed for the software program during the period of time to be a sum of the calculated values for each of the identified protection domains; and determining from the trace information a number of instruction slots available for execution of the instructions of software program during the period of time by identifying processors that each executed at least one of the multiple threads during the period of time; for each of the identified processors, determining a change in the value of the instruction holes counter of the processor during the period of time; and if all of the instruction holes that occurred during the period of time were attributable to the software program, calculating a value for the number of instruction holes for the processor that are attributable to the software program during the period of time to be the determined change in the value of the instruction holes counter; calculating a value for the number of instruction holes that are attributable to the software program during the period of time by all of the identified processors to be a sum of the calculated values for each of the identified processors; and determining the number of instruction slots available for execution of the instructions of software program during the period of time to be a sum of the determined number of instructions executed for the software program during the period of time and of the calculated value for the number of instruction holes that are attributable to the software program during the period of time; and presenting to a user an indication of the determined number of executed instructions for each of the periods of time and an indication of the determined number of available instruction slots for each of the periods of time.

60. The computer-readable medium of claim 59 wherein the computer-readable medium is a memory of a computer.

61. The computer-readable medium of claim 59 wherein the computer-readable medium is a transmission medium containing generated data that includes the contents.

62. A method for generating trace information reflecting a series of events that occurred during execution of multiple software threads on a first computer, the method comprising: receiving an indication of a software program for which trace information is to be generated; receiving indications from a user of one or more locations within the software program at which trace information is to be generated and of one or more types of event each indicating distinct types of trace information; and automatically producing an executable version corresponding to the software program such that when executed the produced executable version will generate trace information corresponding to multiple events of the types indicated by the user, the producing of the executable version including adding multiple groups of instructions to the software program at the locations specified by the user, each of the added groups of instructions corresponding to one of the types of events indicated by the user such that when executed that added group of instructions will generate the distinct types of trace information for that type of event, the produced executable version having one or more portions which can be executed in parallel with multiple software threads, so that execution of the produced executable version will generate trace information reflecting a series of events that occurred during execution.

63. The method of claim 62 including executing the produced executable version to generate the trace information.

64. The method of claim 62 wherein the software program is a source code program, and wherein the producing of the executable version includes compiling the source code program.

65. The method of claim 64 wherein the multiple groups of instructions are added to the software program before the compiling.

66. The method of claim 62 wherein the producing of the executable version includes adding groups of instructions at a plurality of automatically identified locations of the software, the added groups of instructions to each generate trace information of a specified type when executed.

67. The method of claim 66 wherein the automatically identified locations include beginnings of functions having more lines than a threshold.

68. The method of claim 66 wherein the automatically identified locations include ends of functions having more lines than a threshold.

69. The method of claim 66 wherein the automatically identified locations include locations where a compiler adds instructions related to parallelizing the execution of the executable version.

70. The method of claim 69 wherein the instructions related to parallelizing the execution of the executable version are instructions related to a fork.

71. The method of claim 69 wherein the instructions related to parallelizing the execution of the executable version are instructions related to a join.

72. The method of claim 62 wherein, for at least some of the groups of instructions added to the executable program, additional instructions are added to the executable program that when executed creates a descriptor object for that group of instructions.

73. The method of claim 72 wherein the additional groups of instructions are added in such a manner as to be executed before any of the added groups of instructions.

74. The method of claim 62 wherein the first computer has multiple processors that each have multiple protection domains each able to execute at least one of the multiple threads.

75. The method of claim 62 wherein the first computer provides multiple sources of information related to hardware aspects of the first computer that vary with execution of programs on the first computer, and wherein the distinct types of trace information for some of the types of events includes at least one of the sources of information related to the hardware aspects.

76. The method of claim 62 wherein the first computer provides multiple sources of information related to software properties of the first computer that vary with execution of programs on the first computer, and wherein the distinct types of trace information for some of the types of events includes at least one of the sources of information related to the software properties.

77. A computer-readable medium containing instructions that when executed cause a computing device to generate trace information reflecting a series of events that occurred during execution of multiple software threads on a first computer, by: receiving an indication of a software program for which trace information is to be generated; receiving indications of one or more locations within the software program at which trace information is to be generated and of one or more types of event each indicating distinct types of trace information to be generated; and automatically producing an executable version corresponding to the software program such that when executed the produced executable version will generate trace information corresponding to multiple events of the indicated types, the producing of the executable version including adding multiple groups of instructions to the software program at the locations specified by the user, each of the added groups of instructions corresponding to one of the indicated types of events such that when executed that added group of instructions will generate the distinct types of trace information for that type of event, the produced executable version having one or more portions which can be executed in parallel with multiple software threads.

78. A computing device for generating trace information reflecting a series of events that occurred during execution of multiple software threads on a first computer, comprising: an input receiver component capable of receiving an indication of a software program for which trace information is to be generated, and of receiving indications from a user of one or more locations within the software program at which trace information is to be generated and of one or more types of event each indicating distinct types of trace information; and an executable generator component capable of producing an executable version corresponding to the software program such that when executed the produced executable version will generate trace information corresponding to multiple events of the types indicated by the user, the producing of the executable version including adding multiple groups of instructions to the software program at the locations specified by the user, each of the added groups of instructions corresponding to one of the types of events indicated by the user such that when executed that added group of instructions will generate the distinct types of trace information for that type of event, the produced executable version having one or more portions which can be executed in parallel with multiple software threads.

79. A method for generating trace information reflecting a series of events that occurred during execution of multiple software threads on a first computer, the method comprising: receiving an indication of an executable software program that when executed will generate trace information corresponding to multiple events of specified types, the executable software program including multiple groups of instructions added at specified locations and each corresponding to one of the specified types of events; and generating the trace information by executing the executable software program using multiple software threads on the first computer in such a manner that each of the added group of instructions are executed at least once, each execution of an added group of instructions corresponding to a specified type of event generating trace information of the type for that type of event.

80. The method of claim 79 including, before the generating of the trace information by the executing of the executable software, producing the executable software program from software source code in such a manner that the multiple groups of instructions are added at the specified locations.

81. The method of claim 79 wherein the specifications of the locations and of the types of events for the multiple added groups of instructions are received from a user.

82. The method of claim 79 wherein, during the executing of the executable software program, the execution of each of the added groups of instructions adds current values of one or more variables to the generated trace information, the one or more variables maintained by the executing software program and/or by the first computer.

83. The method of claim 79 wherein the first computer has multiple processors that each have multiple protection domains each able to execute at least one of the multiple threads.

84. The method of claim 79 wherein, during the executing of the executable software program and before the execution of an added group of instructions, an additional group of added instructions is executed and creates a descriptor object for that added group of instructions.

85. The method of claims 79 wherein information is added to the generated indicated trace information by an executing program other than the executable software program.

86. The method of claim 85 wherein the other executing program is an operating system.

87. The method of claim 85 wherein the other executing program is a background daemon.

88. A computer-readable medium whose contents cause a computing device to generate trace information reflecting a series of events that occurred during execution of multiple software threads on a first computer, by: receiving an indication of an executable software program that when executed will generate trace information corresponding to multiple events of specified types, the executable software program including multiple groups of instructions added at specified locations and each corresponding to one of the specified types of events; and generating the trace information by executing the executable software program using multiple software threads on the first computer in such a manner that each of the added group of instructions are executed at least once, each execution of an added group of instructions corresponding to a specified type of event generating trace information of the type for that type of event.

89. A computing device for generating trace information reflecting a series of events that occurred during execution of multiple software threads on a first computer, comprising: an input receiver component capable of receiving an indication of an executable software program that when executed will generate trace information corresponding to multiple events of specified types, the executable software program including multiple groups of instructions added at specified locations and each corresponding to one of the specified types of events; and a trace information generator component capable of generating the trace information by executing the executable software program using multiple software threads on the first computer in such a manner that each of the added group of instructions are executed at least once, each execution of an added group of instructions corresponding to a specified type of event generating trace information of the type for that type of event.

90. A method for analyzing trace information generated during execution of multiple threads of a software program on a first computer, the generated trace information reflecting a series of events that occurred during the execution, the method comprising: receiving an indication of trace information generated during execution of an executable software program using multiple software threads on the first computer, the executable software program including multiple groups of instructions each corresponding to a specified type of event, the execution such that each of the multiple group of instructions are executed at least once and such that each execution of an added group of instructions generates trace information of a type corresponding to the specified type of event for that added group of instructions; and analyzing the generated trace information to extract execution information corresponding to the events that occurred during the execution

91. The method of claim 90 including producing the executable software program from software source code in such a manner that the multiple groups of instructions are added at the specified locations.

92. The method of claim 90 including generating the indicated trace information by executing the executable software program using multiple software threads on the first computer.

93. The method of claim 90 wherein, during the executing of the executable software program, the execution of each of the added groups of instructions adds current values of one or more variables to the generated trace information, the one or more variables maintained by the executing software program and/or by the first computer.

94. The method of claim 90 wherein the first computer has multiple processors that each have multiple protection domains each able to execute at least one of the multiple threads.

95. The method of claim 90 wherein the analyzing of the generated trace information includes using a description of the specified types of events to identify groups of trace information each reflecting execution of a group of added instructions corresponding to those specified types of events.

96. The method of claim 95 wherein the analyzing of the trace information includes, when information for an event in the generated trace information is separated in multiple non-contiguous portions, creating a mapping to assist in retrieving the information for the event.

97. The method of claim 96 including using the identified groups of trace information and the created mapping to extract current values for execution information.

98. The method of claim 90 wherein the analyzing of the trace information includes modifying current values of events so as to be normalized with respect to beginning of the execution.

99. The method of claim 90 wherein the analyzing of the trace information includes modifying current values of events so that the values reflect only the execution of the multiple threads.
100. The method of claim 90 including using information retrieved from created descriptor objects in the generated trace information.
101. A computer-readable medium whose contents cause a computing device to analyze trace information generated during execution of multiple threads of a software program on a first computer, the generated trace information reflecting a series of events that occurred during the execution, by: receiving an indication of trace information generated during execution of an executable software program using multiple software threads on the first computer, the executable software program including multiple groups of instructions each corresponding to a specified type of event, the execution such that each of the multiple group of instructions are executed at least once and such that each execution of an added group of instructions generates trace information of a type corresponding to the specified type of event for that added group of instructions; and analyzing the generated trace information to extract execution information corresponding to the events that occurred during the execution
102. A computing device for analyzing trace information generated during execution of multiple threads of a software program on a first computer, the generated trace information reflecting a series of events that occurred during the execution, comprising: an input receiver component capable of receiving an indication of trace information generated during execution of an executable software program using multiple software threads on the first computer, the executable software program including multiple groups of instructions each corresponding to a specified type of event, the execution such that each of the multiple group of instructions are executed at least once and such that each execution of an added group of instructions generates trace information of a type corresponding to the specified type of event for that added group of instructions; and a trace information analysis component capable of analyzing the generated trace information to extract execution information corresponding to the events that occurred during the execution.
103. A method for analyzing trace information generated during execution of multiple threads of a software program on a first computer, the generated trace information reflecting a series of events that occurred during the execution, the method comprising: receiving an indication of execution information extracted from trace information generated during execution of an executable software program using multiple software threads on the first computer, the execution information corresponding to events that occurred during the execution, the executable software program including multiple groups of instructions each corresponding to a specified type of event, the execution such that each of the multiple group of instructions are executed at least once and such that each execution of an added group of instructions generates trace information of a type corresponding to the specified type of event for that added group of instructions; and for each of multiple periods of time, presenting to a user an indication of the extracted execution information for that period of time.
104. The method of claim 103 including analyzing the generated trace information to extract execution information.
105. The method of claim 103 wherein the presenting to the user includes displaying a graph including the indications.
106. The method of claim 103 wherein the indicated extracted execution information includes an indication of a determined number of executed instructions for each of the periods of time and an indication of a determined number of available instruction slots for each of the periods of time.
107. The method of claim 106 wherein the displayed indication of the determined number of available instruction slots for each period of time includes a displayed indication of a calculated number of instruction holes that are attributable to the executing software program during the period of time, with the calculated number of instruction holes displayed in such a manner that a user can visually aggregate the displayed indication of the determined number of executed instructions for that period of time with the displayed indication of calculated number of instruction holes for that period of time.
108. The method of claim 106 wherein the displayed graph includes a time-based axis, and wherein the displayed indications of the determined number of executed instructions and the determined number of available instruction slots for each of the periods of time are points on the graph.
109. The method of claim 108 wherein the displayed graph includes an origin with at least two axes, and including, after the displaying of the indications of the determined number of executed instructions and of the determined number of available instruction slots, redefining at least one of the axes based on a new indicated displayed location.
110. The method of claim 103 including, for at least one of the periods of time, presenting an indication of a logical code block of the executable software program that was executing during that period of time.
111. The method of claim 110 wherein the presented indication of the logical code block is a name of the logical code block.
112. The method of claim 110 wherein the presented indication of the logical code block is source code of the logical code block.
113. The method of claim 110 wherein the logical code block is a function.
114. The method of claim 103 wherein the extracted execution information includes values of one or more variables maintained by the executing software program and/or by the first computer, and including presenting at least some of the included variable values in a tabular format.
115. The method of claim 103 wherein a number of processors of the first computer that are identified during each of the periods of time is greater than one, and wherein a determined number of available instruction slots for each of the periods of time is the identified number of processors for that period of time.
116. The method of claim 103 wherein an identified number of processors of the first computer for at least one of the periods of time is greater than one, and wherein information for each of the processors is aggregated during the presenting of information for those periods of time.
117. The method of claim 116 including normalizing the information for each of the processors prior to the presenting of the information by determining a value for an instruction holes counter of each of the processors at a beginning of the execution of the software program and by subtracting the determined value for each instruction hole counter from each later value of that instruction hole counter.
118. The method of claim 103 wherein information to be presented is determined based on a specification provided by a user.
119. The method of claim 118 wherein the specification is used at the time of the presenting to determine the information to be presented.
120. The method of claim 103 including presenting information about memory references performed during at least one of the periods of time.
121. The method of claim 103 including presenting information about FLOPS performed during at least one of the periods of time.
122. The method of claim 103 including presenting information about a number of the threads in existence during at least one of the periods of time.
123. The method of claim 103 including presenting information about a number of the threads that are blocked during at least one of the periods of time.
124. The method of claim 103 including presenting information about a number of the threads that are ready for execution during at least one of the periods of time.
125. The method of claim 103 including presenting information about contention for locks during at least one of the periods of time.
126. The method of claim 103 wherein the presenting of information is in response to a request by a user.
127. The method of claim 103 wherein the presenting of information is performed automatically without user intervention.
128. A computer-readable medium whose contents cause a computing device to analyze trace information generated during execution of multiple threads of a software program on a first computer, the generated trace information reflecting a series of events that occurred during the execution, by: receiving an indication of execution information extracted from trace information generated during execution of an executable software program using multiple software threads on the first computer, the execution information corresponding to events that occurred during the execution, the executable software program including multiple groups of instructions each corresponding to a specified type of event, the execution such that each of the multiple group of instructions are executed at least once and such that each execution of an added group of instructions generates trace information of a type corresponding to the specified type of event for that added group of instructions; and for each of multiple periods of time, presenting to a user an indication of the extracted execution information for that period of time.
129. A computing device for analyzing trace information generated during execution of multiple threads of a software program on a first computer, the generated trace information reflecting a series of events that occurred during the execution, comprising: an input receiver component capable of receiving an indication of execution information extracted from trace information generated during execution of an executable software program using multiple software threads on the first computer, the execution information corresponding to events that occurred during the execution, the executable software program including multiple groups of instructions each corresponding to a specified type of event, the execution such that each of the multiple group of instructions are executed at least once and such that each execution of an added group of instructions generates trace information of a type corresponding to the specified type of event for that added group of instructions; and a trace information presenter component capable of, for each of multiple periods of time, presenting to a user an indication of the extracted execution information for that period of time.

Description



CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application is a continuation of U.S. patent application Ser. No. 09/221,005 filed Dec. 23, 1998, entitled "PARALLELISM PERFORMANCE ANALYSIS BASED ON EXECUTION TRACE INFORMATION," which has been allowed. The application is hereby incorporated by reference in its entirety.

TECHNICAL FIELD

[0002] The present invention relates generally to analyzing the performance of the execution of a program, and more particularly to analyzing the degree and efficiency of the parallelism during the execution.

BACKGROUND OF THE INVENTION

[0003] Parallel computer architectures generally provide multiple processors that can each be executing different tasks simultaneously. One such parallel computer architecture is referred to as a multithreaded architecture (MTA). The MTA supports not only multiple processors but also multiple streams executing simultaneously in each processor. The processors of an MTA computer are interconnected via an interconnection network. Each processor can communicate with every other processor through the interconnection network. FIG. 1 provides a high-level overview of an MTA computer system 100. Each processor 101 is connected to the interconnection network and memory 102. Each processor contains a complete set of registers 101a for each stream such that the register values at any given time indicate the current stream state. In addition, each processor also supports multiple protection domains, each with counters reflecting the current protection domain state 101b, so that multiple user programs can be executing simultaneously within that processor. Each processor may also have processor-specific counters reflecting the current processor state 101c. The computer system also includes various input devices 105, a display device 110, and a permanent storage device 120.

[0004] Each MTA processor can execute multiple threads of execution simultaneously. Each thread of execution executes on one of the 128
streams supported by an MTA processor. Every clock cycle, the processor selects a stream that is ready to execute and allows it to issue its next instruction. Instruction interpretation is pipelined by the processor, the network, and the memory. Thus, a new instruction from a different stream may be issued in each cycle time period without interfering with other instructions that are in the pipeline. When an instruction finishes, the stream to which it belongs becomes ready to execute the next instruction. Each instruction may contain up to three operations (i.e., a memory reference operation, an arithmetic operation, and a control operation) that are executed simultaneously.

[0005] The state of a stream includes one 64-bit Stream Status Word ("SSW"), 32 64-bit General Registers ("R0-R31"), and eight 32-bit Target Registers ("T0-T7"). Each MTA processor has 128 sets of SSWs, of general registers, and of target registers. Thus, the state of each stream is immediately accessible by the processor without the need to reload registers when an instruction of a stream is to be executed.

[0006] The MTA uses program addresses that are 32 bits long. The lower half of an SSW contains the program counter ("PC") for the stream. The upper half of the SSW contains various mode flags (e.g., floating point rounding, lookahead disable), a trap disable mask (e.g., data alignment and floating point overflow), and the four most recently generated condition codes. The 32 general registers are available for general-purpose computations. Register R0 is special, however, in that it always contains a 0. The loading of register R0 has no effect on its contents. The instruction set of the MTA processor uses the eight target registers as branch targets. However, most control transfer operations only use the low 32 bits to determine a new PC. One target register (T0) points to the trap handler, which may be an unprivileged routine. When the trap handler is invoked, the trapping stream starts executing instructions at the program location indicated by register T0. Trap handling is thus lightweight and independent of the operating system ("OS") and other streams, allowing the processing of traps to occur without OS interaction.

[0007] Each MTA processor supports as many as 16 active protection domains that define the program memory, data memory, and number of streams allocated to the computations using that processor. The operating system typically executes in one of the domains, and one or more user programs can execute in the other domains. Each executing stream is assigned to a protection domain, but which domain (or which processor, for that matter) need not be known by the user program. Each task (i.e., an executing user program) may have one or more threads simultaneously executing on streams assigned to a protection domain in which the task is executing.

[0008] The MTA divides memory into program memory, which contains the instructions that form the program, and data memory, which contains the data of the program. The MTA uses a program mapping system and a data mapping system to map addresses used by the program to physical addresses in memory. The mapping systems use a program page map and a data segment map. The entries of the data segment map and program page map specify the location of the segment in physical memory along with the level of privilege needed to access the segment.

[0009] The number of streams available to a program is regulated by three quantities slim, scur, and sres associated with each protection domain. The current numbers of streams executing in the protection domain is indicated by scur; it is incremented when a stream is created and decremented when a stream quits. A create can only succeed when the incremented scur does not exceed sres, the number of streams reserved in the protection domain. The operations for creating, quitting, and reserving streams are unprivileged. Several streams can be reserved simultaneously. The stream limit slim is an operating system limit on the number of streams the protection domain can reserve.

[0010] When a stream executes a CREATE operation to create a new stream, the operation increments scur, initializes the SSW for the new stream based on the SSW of the creating stream and an offset in the CREATE operation, loads register (T0), and loads three registers of the new stream from general purpose registers of the creating stream. The MTA processor can then start executing the newly created stream. A QUIT operation terminates the stream that executes it and decrements both sres and scur. A QUIT_PRESERVE operation only decrements scur, which gives up a stream without surrendering its reservation.

[0011] The MTA supports four levels of privilege: user, supervisor, kernel, and IPL. The IPL level is the highest privilege level. All levels use the program page and data segment maps for address translation, and represent increasing levels of privilege. The data segment map entries define the minimum levels needed to read and write each segment, and the program page map entries define the exact level needed to execute from each page. Each stream in a protection domain may be executing at a different privileged level.

[0012] Two operations are provided to allow an executing stream to change its privilege level. A "LEVEL_ENTER lev" operation sets the current privilege level to the program page map level if the current level is equal to lev. The LEVEL_ENTER operation is located at every entry point that can accept a call from a different privilege level. A trap occurs if the current level is not equal to lev. The "LEVEL_RETURN lev" operation is used to return to the original privilege level. A trap occurs if lev is greater than the current privilege level.

[0013] An exception is an unexpected condition raised by an event that occurs in a user program, the operating system, or the hardware. These unexpected conditions include various floating point conditions (e.g., divide by zero), the execution of a privileged operation by a non-privileged stream, and the failure of a stream create operation. Each stream has an exception register. When an exception is detected, then a bit in the exception register corresponding to that exception is set. If a trap for that exception is enabled, then control is transferred to the trap handler whose address is stored in register T0. If the trap is currently disabled, then control is transferred to the trap handler when the trap is eventually enabled, assuming that the bit is still set in the exception register. The operating system can execute an operation to raise a domain_signal exception in all streams of a protection domain. If the trap for the domain_signal is enabled, then each stream will transfer control to its trap handler.

[0014] Each memory location in an MTA computer has four access state bits in addition to a 64-bit value. These access state bits allow the hardware to implement several useful modifications to the usual semantics of memory reference. These access state bits are two data trap bits, one full/empty bit, and one forward bit. The two data trap bits allow for application-specific lightweight traps, the forward bit implements invisible indirect addressing, and the full/empty bit is used for lightweight synchronization. The behavior of these access state bits can be overridden by a corresponding set of bits in the pointer value used to access the memory. The two data trap bits in the access state are independent of each other and are available for use, for example, by a language implementer. If a trap bit is set in a memory location, then an exception will be raised whenever that location is accessed if the trap bit is not disabled in the pointer. If the corresponding trap bit in the pointer is not disabled, then a trap will occur.

[0015] The forward bit implements a kind of "invisible indirection." Unlike normal indirection, forwarding is controlled by both the pointer and the location pointed to. If the forward bit is set in the memory location and forwarding is not disabled in the pointer, the value found in the location is interpreted as a pointer to the target of the memory reference rather than the target itself. Dereferencing continues until either the pointer found in the memory location disables forwarding or the addressed location has its forward bit cleared.

[0016] The full/empty bit supports synchronization behavior of memory references. The synchronization behavior can be controlled by the full/empty control bits of a pointer or of a load or store operation. The four values for the full/empty control bits are shown below.

1
VALUE MODE LOAD STORE 0 normal read regardless write regardless and set full 1 reserved reserved 2 future wait for full wait for full and leave full and leave full 3 sync wait for full wait for empty and set empty and set full

[0017] When the access control mode (i.e., synchronization mode) is future, loads and stores wait for the full/empty bit of the memory location to be accessed to be set to full before the memory location can be accessed. When the access control mode is sync, loads are treated as "consume" operations and stores are treated as "produce" operations. A load waits for the full/empty bit to be set to full and then sets the full/empty bit to empty as it reads, and a store waits for the full/empty bit to be set to empty and then sets the full/empty bit to full as it writes. A forwarded location (i.e., its forward bit is set) that is not disabled (i.e., by the access control of a pointer) and that is empty (i.e., full/empty bit is set to empty) is treated as "unavailable " until its full/empty bit is set to full, irrespective of access control.

[0018] The full/empty bit may be used to implement arbitrary indivisible memory operations. The MTA also provides a single operation that supports extremely brief mutual exclusion during "integer add to memory." The FETCH_ADD operation loads the value from a memory location, returns the loaded value as the result of the operation, and stores the sum of that value and another value back into the memory location.

[0019] Each protection domain has a retry limit that specifies how many times a memory access can fail in testing full/empty bit before a data blocked exception is raised. If the trap for the data blocked exception is enabled, then a trap occurs. The trap handler can determine whether to continue to retry the memory access or to perform some other action. If the trap is not enabled, then the next instruction after the instruction that caused the data blocked exception is executed.

[0020] A speculative load occurs typically when a compiler generates code to issue a load operation for a data value before it is known whether the data value will actually be accessed by the program. The use of speculative loads helps reduce the memory latency that would result if the load operation was only issued when it was known for sure whether the program actually was going to access the data value. Because a load is speculative in the sense that the data value may not actually be needed by the program, it is possible that a speculative load will load a data value that the program does not actually use. The following statements indicate program statement for which a compiler may generate a speculative load:

[0021] if i<N

[0022] x--buffer[i]

[0023] endif

[0024] The following statement illustrates the speculative load that is placed before the "if" statement.

[0025] r=buffer[i]

[0026] if i<N

[0027] x=r

[0028] endif

[0029] The compiler has generated code to load the data value for buffer[i] into a general register "r" and placed it before the code generated for the "if" statement condition. The load of the data value could cause an exception, such as if the index i was so large that an invalid memory location was being accessed. However, the necessity of this exception is uncertain at that time since the invalid memory location will not be accessed by the original code unless the "if" statement condition is satisfied (i.e., i<N). Even if the "if" statement condition is satisfied, the exception would not have occurred until a later time. To prevent a speculative load from causing an incorrect exception to occur or occur too early, the MTA has a "poison" bit for each general register. Whenever a load occurs, the poison bit is set or cleared depending on whether an exception would have been raised. If the data in a general register is then used while the corresponding poison bit is set, then an exception is raised at the time of use. In the above example, the "r=buffer[i]" statement would not raise an exception, but would set the corresponding poison bit if the address is invalid. An exception, however, would be raised when the "x=r" statement is executed accessing that general register because its poison bit is set. The deferring of the exceptions and setting of the poison bits can be disabled by a speculative load flag in the SSW.

[0030] The upper 32-bits of the 64-bit exception register contain the exception flags, and the lower 32 bits contain the poison bits. Bits 40-44 contain the flags for the user exceptions, which include a create stream exception, a privileged instruction exception, a data alignment exception, and a data blocked exception. A data blocked exception is raised when a data memory retry exception, a trap 0 exception, or a trap 1 exception is generated. The routine that is handling a data blocked exception is responsible for determining the cause of the data blocked exception. The exception register contains one poison bit for each of the 32 general registers. If the poison bit is set, then an attempt to access the content of the corresponding register will raise an exception.

[0031] The lower 32 bits of the 64-bit SSW contain the PC, bits 32-39
contain mode bits, bits 40-51 contain a trap mask, and bits 52-63 contain the condition codes of the last four instructions executed. Bit 37 within the mode bits indicates whether speculative loads are enabled or disabled. Bit 48 within the trap mask indicates whether a trap on a user exception is enabled (corresponding to bits 40-44 of the exception register). Thus, traps for the user exceptions are enabled or disabled as a group.

[0032] Each word of memory contains a 64-bit value and a 4-bit access state. The 4-bit access state is described above. When the 64-bit value is used to point to a location in memory, it is referred to as a "pointer." The lower 48 bits of the pointer contains the address of the memory location to be accessed, and the upper 16 bits of the pointer contain access control bits. The access control bits indicate how to process the access state bits of the addressed memory location. One forward disable bit indicates whether forwarding is disabled, two full/empty control bits indicate the synchronization mode; and four trap 0 and 1 disable bits indicate whether traps are disabled for stores and loads, separately. If the forward disable bit is set, then no forwarding occurs regardless of the setting of the forward enable bit in the access state of the addressed memory location. If the trap 1 store disable bit is set, then a trap will not occur on a store operation, regardless of the setting of the trap 1 enable bit of the access state of the addressed memory location. The trap 1 load disable, trap 0 store disable, and trap 0 load disable bits operate in an analogous manner. Certain operations include a 5-bit access control operation field that supersedes the access control field of a pointer. The 5-bit access control field of an operation includes a forward disable bit, two full/empty control bits, a trap 1 disable bit, and a trap 0 disable bit. The bits effect the same behavior as described for the access control pointer field, except that each trap disable bit disables or enables traps on any access and does not distinguish load operations from store operations.

[0033] When a memory operation fails (e.g., a synchronized access failure), an MTA processor saves the state of the operation. A trap handler can access that state. That memory operation can be redone by executing a redo operation (i.e., DATA_OP_REDO) passing the saved state as parameters of the operation. After the memory operation is redone (assuming it does not fail again), the trapping stream can continue its execution at the instruction after the trapping instruction.

[0034] The appendix contains the "Principles of Operation" of the MTA, which provides a more detailed description of the MTA.

[0035] While the use of a multithreaded architecture provides various benefits, the architecture also adds various complexities to conducting performance analysis of executing tasks. Such performance analysis attempts to quantify various performance measures that indicate how efficiently computer system resources are utilized during execution (e.g., processor utilization) as well as other measures related to the execution (e.g., memory latency, total execution time, or the number and rate of executed FLOPS, memory references, or invocations of a particular function).

[0036] When a task executes on a multithreaded architecture, a variety of additional parallelism performance measures are available to be measured and tracked. For example, it may be of interest to have information related to the threads for the task, such as the number of task threads executing, the number of task threads blocked, the number of task threads ready and waiting to be executed, and the number of threads contending for a lock. Similarly, it may be of interest to track information related to the one or more protection domains in which the task is executing (e.g., the total number of instructions issued in each protection domain), to the streams allocated to the one or more protection domains (e.g., the number of streams allocated to the protection domain), and to the one or more processors executing the task (e.g., the number of streams ready to be executed at each cycle). In addition, parallelism information about which regions of the task source code were parallelized (i.e., executed by different simultaneously executing threads) during execution and the degree of parallelism (i.e., how many different threads were simultaneously executing in how many different protection domains) for those regions may be of interest.

[0037] Various techniques have been used to assist in performance analysis. One such technique, referred to as profiling, attempts to determine how many times each source code statement is executed. Such information allows user attention to be directed to manually optimizing the portions of the source code that are most often executed. However, such analysis is typically concerned only with minimizing the total execution time of the task, and does not address any of the performance analysis issues related specifically to multithreaded architectures and parallelism.

[0038] Another technique useful for performance analysis involves generating during execution of the task various execution trace information that is related to different performance measures, referred to as tracing the task or as tracing the source code for the task. One method of generating such trace information is to have instructions in the source code that when executed will output information to a trace information file. This trace information file can then be examined after execution of the task has completed. For example, to estimate the amount of time spent executing a function, instructions before and after invocations of the function can write out the current time to the trace information file.

[0039] One factor complicating performance analysis is that many computer systems do not directly provide information about many types of performance measures, such as the number of phantoms for a processor (i.e., a hole in the instruction pipeline such that an instruction is not executed during a processor cycle) or the number of memory references that occur. It is even less likely for computer systems to directly provide execution information about parallelism performance measures such as parallelized regions and the degree of parallelism. Thus, generating accurate performance measure information is problematic, particularly with respect to parallelism such as that present on multithreaded architectures.

SUMMARY OF THE INVENTION

[0040] Some embodiments of the present invention provide a method and system for conducting performance analysis for task execution. The analysis involves generating a variety of trace information related to performance measures, including parallelism-related information, during execution of the task. In order to generate the trace information, target source code of interest is compiled in such a manner that executing the resulting executable code will generate execution trace information composed of a series of events. Each event stores trace information related to a variety of performance measures for the one or more processors and protection domains used. After the execution trace information has been generated, the system can use that trace information and a trace information description file to produce useful performance measure information. The trace information description file contains information that describes the types of execution events as well as the structure of the stored information. The system uses the trace information description file to organize the information in the trace information file, extracts a variety of types of performance measure information from the organized trace information, and formats the extracted information for display. The system can use default or user-defined functions to extract and format trace information for display. After the system displays one or more types of performance measure information, a user of the system can then interact with the system in a variety of ways to obtain other useful performance analysis information.

BRIEF DESCRIPTION OF THE DRAWINGS

[0041] FIG. 1 provides a high-level overview of an MTA computer, with each processor 101 connected to the interconnection network and memory 102.

[0042] FIG. 2 illustrates an embodiment of the system of the present invention for generating and displaying execution performance measure information.

[0043] FIG. 3 is a block diagram illustrating hardware support for gathering performance measure information for each processor on which a task is executing.

[0044] FIGS. 4A, 4B, and 4C illustrate examples of a Trace Information Description File, an Execution Trace Information File, and a Trace Information Display Functions File respectively.

[0045] FIGS. 5A-5I are example user interface screens displaying various parallelism performance analysis data.

[0046] FIG. 6 is a flow diagram of an embodiment of the Create Traceable Executable Code routine.

[0047] FIGS. 7A and 7B are flow diagrams of an embodiment of a Compiler subroutine supporting a trace option.

[0048] FIG. 8 is a flow diagram of an embodiment of the Display Trace Information From Execution Trace Information File routine.

[0049] FIG. 9 is a flow diagram of an embodiment of the Process Execution Trace Information File To Extract Trace Information To Be Displayed subroutine.

[0050] FIGS. 10A and 10B are flow diagrams of an embodiment of the Display Trace Information subroutine.

DETAILED DESCRIPTION OF THE INVENTION

[0051] An embodiment of the present invention provides a method and system for conducting performance analysis for executing tasks. The analysis is made possible by generating a variety of performance measure information, including parallelism-related information, during execution of the task. In particular, target source code of interest is compiled in such a manner that executing the resulting target executable code will generate execution trace information related to a variety of performance measures for the one or more processors and protection domains used. The compiler will add a variety of instructions to the code such that execution of the added instructions will generate the trace information, including using any hardware-supported mechanisms (e.g., counters and registers) for retrieving trace information.

[0052] After the execution trace information has been generated, the Trace Information Displayer (TID) system can use that trace information, along with a trace information description file, to produce useful performance measure information. The trace information description file contains information that describes the types of execution events that store trace information (e.g., an entry point into a function or the beginning of a parallelized region) as well as the structure of the information that will be stored for such events (e.g., the first 4 bytes are the value of hardware counter X, the next 4 bytes are the ID for the thread that generated the event, etc.). This description information allows the TID system to assign conceptual meaning to the raw trace information, and to extract performance measure information of interest.

[0053] After the TID system uses the trace information description file to define the types of information present in the trace information file, the system divides the raw trace information into groups corresponding to the events (e.g., an event indicating an entry point into a particular function) and assigns meaning to the raw trace information generated for each such event (e.g., the first 8 bytes of this event hold the name of the particular function). The TID system can also organize the trace information, such as by ensuring that the events are in chronological order (e.g., to adjust for some event trace information being buffered longer than other event trace information before being written to a trace information file).

[0054] The TID system next extracts a variety of types of performance measure information from the organized trace information, and formats the extracted information for display. For example, it may be of interest to graph the increasing values over the course of the task execution for a particular cumulative hardware counter, or to instead graph the corresponding rate of change for the cumulative counter. In either case, the TID system may produce a series of X-Y points for the hardware counter value, using time (or processor clock cycles) for the X axis and using the value of the counter for the Y axis. The TID system can use default or user-defined functions to extract and format trace information for display.

[0055] After the TID system displays one or more types of performance measure information, a user of the system can then interact with the system in a variety of ways to obtain other useful performance analysis information. For example, the user can select a displayed point corresponding to an event, and have the TID system display the raw trace data for the event. Alternately, the TID system can annotate one or more displayed events with relevant information, such as the name of the corresponding source code function being executed. The TID system can also map trace information events back to the source code that was executing when the event occurred, and can display the corresponding source code to the user. The displayed source code can graphically illustrate parallelism information such as regions of the source code that were automatically parallelized by the compiler as well as the reason the compiler was unable to parallelize other regions.

[0056] In addition to providing event definitions which enable conceptual meaning to be assigned to the raw trace information, the trace information description file can also provide a variety of other types of information for the defined events. For example, the trace information description file can specify print formats and tabular display formats for each event type to be used when printing or displaying the raw data for an event of that type. In addition, the trace information description file can indicate relationships between different types of generated trace information to indicate that some trace information may be related to other trace information. For example, information that is common to multiple events may be stored in the trace information file separately from those events (to avoid the need to replicate the same information a number of times), and those events may reference the separately stored common information.

[0057] FIG. 2 illustrates a computer system 100 suitable for generating execution trace information for an executing task, and a client computer system 250 suitable for executing the TID system so as to generate and display performance measure information from the execution trace information. In the illustrated embodiment, a software developer has created the source code file 222 stored on the permanent storage device 120 of the computer system 100, and wishes to conduct performance analysis for the execution of the source code. In order to conduct performance analysis using the technique of the present invention, an executable version of the source code is generated that will produce appropriate trace information when executed. Those skilled in the art will appreciate that in alternate embodiments, a single computer system can be used to both generate execution trace information and to generate and display resulting performance measure information. In addition, those skilled in the art will appreciate that source code can be created in a variety of ways, such as via a basic text editor or via one of a variety of application development environments.

[0058] In order for the executing code to generate trace information, a variety of sample points will be added at various locations of interest in the source code. Each sample point will be of a specified type designed to sample the current values of a particular set of performance measure values. The sample points can be added to the source code in a variety of ways. For example, the developer can manually insert a sample point at any location of interest in the source code by adding an appropriate compiler directive. When the source code is compiled, the compiler directive will instruct the compiler to add the appropriate code for the sample point at that location.

[0059] In addition to any manually specified sample points, the compiler can also insert a variety of types of sample points automatically. For example, tracking when a function is entered or exited is often of interest, so the compiler can automatically add a sample point at the beginning of each function indicating the entry and at the end of each function indicating the exit. Alternately, only certain functions of interest may have sample points added, such as the top-level function (e.g., `main` for the C language) or all functions that contain at least a specified minimum number of source code statements.

[0060] In addition, compilers for multithreaded architectures will often attempt to identify and mark regions of code that can be parallelized. For example, a region of code that loops multiple times with sequential values of a variable (e.g., `loop for x from 1 to 10`) may be able to be divided among multiple processors so that each instance of the loop (with a different value of the variable) can be executed by a different thread simultaneously. After the compiler has identified parallelizable regions, the compiler can add appropriate sample points at locations of interest within the parallelizable regions, such as at the beginning of the region (i.e., the fork), at the end of the region (i.e., the join), and at any sections which serve as synchronization points such that multiple threads executing the code will wait at the synchronization point for other threads to reach the same point (i.e., barriers).

[0061] Thus, when the developer wishes to create a traceable version of the executable code that generates trace information corresponding to a variety of sample points, the developer supplies the source code file 222
to the compiler 230 executing in memory 102 and indicates a trace option to the compiler to instruct the compiler to generate traceable executable code. The developer can also supply to the compiler a variety of other compiling instructions via one or more input devices 105. These other compiling instructions can instruct the compiler as to the types of sample points to be added (e.g., function entry and exit sample points, but not parallelizable region sample points) as well as indications of particular portions of interest in the source code (e.g., only functions with at least 50 statements).

[0062] After receiving the source code and any compiling instructions, the compiler generates the traceable executable code file 224 stored on the permanent storage. In the illustrated embodiment, the various sample points are implemented by having the compiler add appropriate instructions to the source code at each sample point location before the source code is compiled. When executed, these added instructions will determine the hardware-related and software-related values of interest, and will write an event containing those values to an execution trace information file. Thus, each event in the execution trace information file will be of a specified type that corresponds to the sample point that creates the event. As described previously, in the illustrated embodiment a trace information description file describes the types of execution events and their corresponding structure. Those skilled in the art will appreciate that in some embodiments the definitions of the events from the trace information description file merely reflect pre-defined event types that are provided by the compiler, while in other embodiments the compiler can use the information in the trace information description file to define the possible event types and to determine the appropriate instructions to be added to generate such events.

[0063] Those skilled in the art will appreciate that the instructions could alternately be added after the source code had been compiled, and that the added instructions could produce trace information other than by writing the information to a file (e.g., by supplying the information to another process via inter-process communication). In addition, those skilled in the art will appreciate that the compiler will, at compile time, have access to a variety of information which can be specified in the instructions added to the source code, such as the symbolic names of functions and variables related to the added instructions. In addition, the compiler can add information to the generated executable code to assist in creating a mapping between the compiled instructions and their corresponding source code. This mapping can assist in determining the source code that corresponds to a particular value of the PC during execution.

[0064] Those skilled in the art will also appreciate that the instructions added for a single sample point may create multiple events if the source code corresponding to the sample point is executed multiple times (e.g., if it is within an invokable function or within a loop). In such situations, some of the information for the multiple events may be common to all of the events, such as static information that does not change for different executions of the sample point (e.g., the name of the function in which the sample point is located and the PC corresponding to the sample point). In other situations, events that are generated from different sample points may also share common information.

[0065] In some embodiments, such common information is redundantly generated for each such event and is separately written to the execution trace information file. In alternate embodiments, the common information is written to the execution trace information file separately from the specific events, and thus need be written only once. Each such group of separately stored common information is referred to as a descriptor object, and is referred to by the various events which share the information. In these embodiments, the descriptor objects are treated as if they are part of the structure of the various events. When the common information is written to the execution trace information file separately from the specific events, the instructions added by the compiler at each sample point generate the event-specific information that is not common to multiple events.

[0066] Thus, other instructions must be added to the executable code to generate the descriptor objects. Moreover, it is useful if the descriptor objects are stored earlier in the execution trace information file than the event-specific information, so that when a event-specific reference to a descriptor object is located the TID system can immediately calculate the offset from the reference to the earlier descriptor object to facilitate retrieving information from the descriptor object for the event. The compiler therefore additionally generates a series of instructions corresponding to each set of common information, and adds those series of instructions to the source code before the source code corresponding to the initial execution of the top-level function. Those instructions will be executed before any event-specific instructions within the source code, and the common information will thus be present at the beginning of the trace information file. After the compiler has added the various appropriate instructions for the sample points, the compiler then compiles the source code with the added instructions in the normal manner.

[0067] In the illustrated embodiment, each processor 101 has multiple protection domains 240 in which different processes can simultaneously execute, and a process executing in a protection domain can have multiple threads simultaneously executing on multiple streams assigned to the protection domain. If the compiler identified at least one parallelizable region for the traceable executable code file, then portions of the traceable executable code can be loaded into multiple protection domains on different processors when the code is executed. If so, the traceable executable code will execute as a task spread across multiple protection domains. Each group of threads for a process in a protection domain is referred to as a team, with the executing task thus capable of being composed of teams of threads executing in multiple protection domains on multiple processors.

[0068] When an executing thread executes the instructions added for a sample point, the execution of the instructions will cause an event to be added to an execution trace information file. As previously described, each event can consist of multiple pieces of trace information of a variety of types. In the illustrated embodiment, the trace information for each event is written out atomically such that all trace information for an event will be added to the execution trace information before trace information for another event is added. Those skilled in the art will appreciate that this can be accomplished in a variety of ways, such as buffering output trace information before it is written to the file and waiting to write the buffer until all trace information for each event in the buffer is available. Thus, as the traceable executable code file 224 executes in the multiple protection domains 240, corresponding event information is written to the execution trace information file 225
on the permanent storage. An exemplary execution trace information file is discussed in greater detail with respect to FIG. 4A.

[0069] In addition to the trace information produced by the execution of the traceable executable code, it is also possible for other executing processes or tasks to generate execution trace information related to the execution of the traceable executable code. For example, the operating system for one or more of the processors may have information relevant to the parallelism performance analysis of the traceable executable code that is not available to the traceable executable code itself. For example, creating additional threads for an executing process, swapping a process in and out of a protection domain, or changing the security level for a process may all be occurrences of interest. In some embodiments, the user-level traceable executable code may perform some or all of these activities directly, and if so can have sample points to record trace information related to those occurrences. In alternative embodiments, the operating system may perform some or all of these types of tasks, and if so the operating system can add corresponding trace information to the execution trace information file 225 at the time of these occurrences. In the illustrated embodiment, a protection domain 245 on each processor is devoted to providing operating system functionality to other processes executing on the processor.

[0070] It is also possible for processes other than the operating system to generate execution trace information for the traceable executable code. For example, it may be desirable to ensure that trace information is generated for the traceable executable code at regular time intervals (e.g., every 256 cycles). However, the execution time required for various portions of the traceable executable code can be unpredictable, so merely adding sample points to the traceable executable code may not be sufficient to ensure that the trace information is generated at the desired regular intervals. Instead, it is possible to create a background daemon that executes as a thread (not shown) and that monitors the protection domains 240 in which the task is executing. Such a daemon can periodically sample values of relevant performance measures and write a corresponding event to the execution trace information file 225.

[0071] After the execution trace information file 225 has been created, the trace information can be analyzed and displayed to the developer or another user to provide feedback related to the performance of the task execution. The TID system performs this analysis and provides an interactive interface which can display a variety of types of information to the developer. Thus, in the illustrated embodiment the execution trace information file 225 and the trace information description file 227 are supplied to the TID system 260 executing in memory 252 of the client computer system 250.

[0072] The TID system first uses the trace information description file 227 to define the types of events which may be present in the execution trace information file. Those skilled in the art will appreciate that a variety of other alternate mechanisms are available to provide such information to the TID system, such as explicitly adding such information to each event in the execution trace information file (e.g., with a header) or by hard-coding the types of events and their structures in the TID system. If event descriptor objects are present in the trace information file, the events referencing the objects are also mapped to their corresponding objects. An exemplary trace information description file is discussed in greater detail with respect to FIG. 4B.

[0073] In order to provide more useful execution trace information, the TID system also performs a variety of normalization operations to ensure that the values for a particular performance measure are consistent with each other. For example, if hardware counters values are included in the trace information, such counters often begin incrementing upon the boot of the processor. Thus, the counters may already have large values when execution of the task begins. The TID system therefore normalizes all such counter values so that the beginning of the task execution corresponds to a value of zero, by subtracting the value of each counter at the beginning of the task execution from all later values for that counter. In addition, many counters will continue to increment even during periods when the task is not executing, such as when the task is swapped out of the protection domain (e.g., a counter of the number of instructions issued in the protection domain). Thus, the TID system normalizes such counter values so that they reflect only the execution of the task. Another normalization situation arises when information from multiple protection domains is present in the trace information file and when the information to be extracted for each event requires the current values of the hardware counters for all protection domains. Thus, if an event that occurred at one time in one protection domain does not have corresponding events from the other protection domains at the same time, the values for the hardware-supported counters in those other protection domains will be estimated based on available events for those protection domains.

[0074] After the execution trace information is analyzed by the TID system, the system can extract a variety of types of performance measure information and format the extracted information in an appropriate manner for display. In order to determine appropriate information to extract, the system uses one or more display functions which define the types of event information to extract from the raw trace information. Exemplary trace information display functions are discussed in greater detail with respect to FIG. 4C.

[0075] In the illustrated embodiment, the TID system can optionally receive a set of user-defined trace information display functions 229
that are stored on the permanent storage. If such display functions are received, they will be used to extract the information that will be initially displayed by the TID system. Alternately, if user-defined trace information display functions are not supplied, default display functions can be used. After the information of interest has been extracted and formatted for display, the information is graphically displayed on the display 254 of the client computer system 250. The developer or another user can then use the input/output devices 258 to interact with the TID system and to supply requests to modify the display and to display alternate types of information.

[0076] In the illustrated embodiment, the user of the TID system can request that a variety of types of trace information be displayed. For example, time-based graphs of counter values over the period of task execution or of the rate of change of such counters over the period may be displayed. The user can also request to see raw data for a selected event, to print a displayed graph or the trace information corresponding to a selected event, to redefine the origin of the x-axis for a displayed graph, to do source mapping for a selected event and show the source code executing when the event occurred, and to automatically add annotations related to events such as the names of functions corresponding to displayed sets of events.

[0077] Those skilled in the art will appreciate that the displayed computer systems are merely illustrative and are not intended to limit the scope of the present invention. The computer systems may contain additional components or may lack some illustrated components. For example, the TID system could execute without the use of the client computer system 250 by having the TID system execute in memory 102 and display results on display 110. Alternately, components currently shown as executing in memory 102 could instead execute on the client computer system 250, such as the compiler 230. Parallelism performance analysis can also be performed when the traceable executable code file executes in a single protection domain on a single processor if the task uses multiple streams and multiple threads within that protection domain. In addition, information could be provided to the client computer system 250
in a variety of ways other than from files on permanent storage 120, such as via inter-process communication. Accordingly, the present invention may be practiced with other computer system configurations.

[0078] FIG. 3 is a block diagram illustrating an embodiment of hardware components related to a processor on which a program to be traced is executing. In the illustrated embodiment, each processor provides hardware support to record a variety of types of execution information of interest, and to provide that information to an executing program. Thus, the instructions added to the traceable executable code can access current values for various hardware-supported information sources and can output that retrieved information as trace information.

[0079] In the illustrated embodiment, each processor 101 maintains a processor state 101c which includes a variety of current values related to the processor. The processor state includes a clock counter 312 that contains the current value of the clock (which is updated each cycle), a phantom counter 314 that records the number of phantoms that have occurred for the processor since some given start point (such as the last boot of the computer system), and a ready stream counter 316 that sums the total number of streams ready to be executed at each cycle of the processor. The values for each of the counters can be stored in a variety of ways, such as by dedicating a separate 64-bit counter for each value. In the illustrated embodiment, when multiple processors are being used the clocks on all processors are synchronized so that they agree in value. User-level instructions can be provided so that the traceable executable code can access the current values in any of the processor state counters.

[0080] Various information is also maintained for each stream in a stream state 101a for that stream. As previously described, one such type of stream information includes a stream status word 302 for each stream, with the lower 32