Table 19.2. Profile Diagnostics
Group | Flag | Benefit | Cost | Freq. | Implemented | |
---|---|---|---|---|---|---|
CONTAINERS | HASHTABLE_TOO_SMALL | 10 | 1 | 10 | yes | |
HASHTABLE_TOO_LARGE | 5 | 1 | 10 | yes | ||
INEFFICIENT_HASH | 7 | 3 | 10 | yes | ||
VECTOR_TOO_SMALL | 8 | 1 | 10 | yes | ||
VECTOR_TOO_LARGE | 5 | 1 | 10 | yes | ||
VECTOR_TO_HASHTABLE | 7 | 7 | 10 | no | ||
HASHTABLE_TO_VECTOR | 7 | 7 | 10 | no | ||
VECTOR_TO_LIST | 8 | 5 | 10 | yes | ||
LIST_TO_VECTOR | 10 | 5 | 10 | no | ||
ORDERED_TO_UNORDERED | 10 | 5 | 10 | only map/unordered_map | ||
ALGORITHMS | SORT | 7 | 8 | 7 | no | |
LOCALITY | SOFTWARE_PREFETCH | 8 | 8 | 5 | no | |
RBTREE_LOCALITY | 4 | 8 | 5 | no | ||
FALSE_SHARING | 8 | 10 | 10 | no |
Switch:
_GLIBCXX_PROFILE_CONTAINERS
.
Sample runtime reduction: 36%. Code similar to example below.
Recommendation: Set initial size to N at construction site S.
To instrument:
unordered_set, unordered_map
constructor, destructor, rehash.
Cost model: Number of individual rehash operations * cost per rehash.
1 unordered_set<int> us; 2 for (int k = 0; k < 1000000; ++k) { 3 us.insert(k); 4 } foo.cc:1: advice: Changing initial unordered_set size from 10 to 1000000 saves 1025530 rehash operations.
Recommendation: Set initial size to N at construction site S.
To instrument:
unordered_set, unordered_map
constructor, destructor, rehash.
1 vector<unordered_set<int>> v(100000, unordered_set<int>(100)) ; 2 for (int k = 0; k < 100000; ++k) { 3 for (int j = 0; j < 10; ++j) { 4 v[k].insert(k + j); 5 } 6 } foo.cc:1: advice: Changing initial unordered_set size from 100 to 10 saves N bytes of memory and M iteration steps.
Recommendation: Set initial size to N at construction site S.
Cost model: Total number of words copied * time to copy a word.
1 vector<int> v; 2 for (int k = 0; k < 1000000; ++k) { 3 v.push_back(k); 4 } foo.cc:1: advice: Changing initial vector size from 10 to 1000000 saves copying 4000000 bytes and 20 memory allocations and deallocations.
Recommendation: Set initial size to N at construction site S.
1 vector<vector<int>> v(100000, vector<int>(100)) ; 2 for (int k = 0; k < 100000; ++k) { 3 for (int j = 0; j < 10; ++j) { 4 v[k].insert(k + j); 5 } 6 } foo.cc:1: advice: Changing initial vector size from 100 to 10 saves N bytes of memory and may reduce the number of cache and TLB misses.
Goal: Detect uses of
unordered_set
that can be substituted with vector
to reduce execution time.
Fundamentals: Hashtable iterator is slower than vector iterator.
1 unordered_set<int> us; ... 2 int s = 0; 3 for (unordered_set<int>::iterator it = us.begin(); it != us.end(); ++it) { 4 s += *it; 5 } foo.cc:1: advice: Changing "unordered_set" to "vector" will save about N indirections and may achieve better data locality.
Goal: Detect cases where
vector
could be substituted with list
for
better performance.
Fundamentals: Inserting in the middle of a vector is expensive compared to inserting in a list.
1 vector<int> v; 2 for (int i = 0; i < 10000; ++i) { 3 v.insert(v.begin(), i); 4 } foo.cc:1: advice: Changing "vector" to "list" will save about 5,000,000 operations.
Goal: Detect cases where
list
could be substituted with vector
for
better performance.
Fundamentals: Iterating through a vector is faster than through a list.
Analysis:
Issue the advice if there are no insert
operations.
1 list<int> l; ... 2 int sum = 0; 3 for (list<int>::iterator it = l.begin(); it != l.end(); ++it) { 4 sum += *it; 5 } foo.cc:1: advice: Changing "list" to "vector" will save about 1000000 indirect memory references.
Goal: Detect cases where
list
could be substituted with forward_list
for
better performance.
Analysis:
Issue the advice if there are no backwards
traversals
or insertion before a given node.
1 list<int> l; ... 2 int sum = 0; 3 for (list<int>::iterator it = l.begin(); it != l.end(); ++it) { 4 sum += *it; 5 } foo.cc:1: advice: Change "list" to "forward_list".
Goal: Detect cases where ordered associative containers can be replaced with unordered ones.
Fundamentals: Insert and search are quicker in a hashtable than in a red-black tree.
1 set<int> s; 2 for (int i = 0; i < 100000; ++i) { 3 s.insert(i); 4 } 5 int sum = 0; 6 for (int i = 0; i < 100000; ++i) { 7 sum += *s.find(i); 8 }
Switch:
_GLIBCXX_PROFILE_ALGORITHMS
.
Fundamentals: See papers: A framework for adaptive algorithm selection in STAPL and Optimizing Sorting with Machine Learning Algorithms.
Sample runtime reduction:60%.
Recommendation: Change sort algorithm at site S from X Sort to Y Sort.
To instrument: sort
algorithm.
Analysis: Issue the advice if the cost model tells us that another sort algorithm would do better on this input. Requires us to know what algorithm we are using in our sort implementation in release mode.
Cost model: Runtime(algo) for algo in [radix, quick, merge, ...]
Example:
Switch:
_GLIBCXX_PROFILE_LOCALITY
.
Cost model: Total distance between pointer values of successive elements in vectors of pointers.
1 int zero = 0; 2 vector<int*> v(10000000, &zero); 3 for (int k = 0; k < 10000000; ++k) { 4 v[random() % 10000000] = new int(k); 5 } 6 for (int j = 0; j < 10000000; ++j) { 7 count += (*v[j] == 0 ? 0 : 1); 8 } foo.cc:7: advice: Insert prefetch instruction.
Fundamentals:Allocation can be tuned to a specific traversal pattern, to result in better data locality. See paper: Custom Memory Allocation for Free.
Sample runtime reduction:30%.
Recommendation: High scatter score N for container built at site S. Consider changing allocation sequence or choosing a structure conscious allocator.
To instrument: Methods of all containers using linked structures.
Analysis:
First, get cache line size and page size from system.
Then record the number of successive elements that are on different line
or page, for each traversal method such as find
. Give advice
only if the ratio between this number and the number of total node hops
is above a threshold.
Cost model: Sum(same_cache_line(this,previous))
Example:
1 set<int> s; 2 for (int i = 0; i < 10000000; ++i) { 3 s.insert(i); 4 } 5 set<int> s1, s2; 6 for (int i = 0; i < 10000000; ++i) { 7 s1.insert(i); 8 s2.insert(i); 9 } ... // Fast, better locality. 10 for (set<int>::iterator it = s.begin(); it != s.end(); ++it) { 11 sum += *it; 12 } // Slow, elements are further apart. 13 for (set<int>::iterator it = s1.begin(); it != s1.end(); ++it) { 14 sum += *it; 15 } foo.cc:5: advice: High scatter score NNN for set built here. Consider changing the allocation sequence or switching to a structure conscious allocator.
Switch:
_GLIBCXX_PROFILE_MULTITHREADED
.
Recommendation: Change data distribution or parallel algorithm.
Analysis: Keep a shadow for each container. Record iterator dereferences and container member accesses. Issue advice for elements referenced by multiple threads. See paper: The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization.
Cost model: Number of accesses to elements referenced from multiple threads
Example:
Recommendation: Reorganize container or use padding to avoid false sharing.
Cost model: Number of accesses to same cache line from different threads.
1 vector<int> v(2, 0); 2 #pragma omp parallel for shared(v, SIZE) schedule(static, 1) 3 for (i = 0; i < SIZE; ++i) { 4 v[i % 2] += i; 5 } OMP_NUM_THREADS=2 ./a.out foo.cc:1: advice: Change container structure or padding to avoid false sharing in multithreaded access at foo.cc:4. Detected N shared cache lines.