Description
CS610 – Assignment 2
Shivam Sharma (210983, sshivam21@iitk.ac.in)
1 Problem 1
NOTE – The orignal testcase given was too small for any meaningful benchmarking and perf c2c could not sample enough times from it, so every benchmark here is on a file split into 5 files of 1MB each, the input is biginput. The mentioned file is also submitted.
The given code runs in about 145ms on my machine. Running perf c2c gives the following output –
This indicates that 3 cache lines may be false shared. Descibed below are the performance bugs and optimizations,
1.1 False Sharing of tracker.word_count
Fix – Seperate the cache lines by adding 56 byte padding and aligning the start address to 64. This is implemented in problem1_pad_shared.cpp and reduced the runtime to 110ms. Running perf now
gives,
We have successfully avoided one false sharing.
1.2 Unnecessary Contention of Locks
The threads continuously try to accquire tracker.word_count_mutex and line_count_mutex, but the critical section just contains of incrementing an integer. This leads to unnecessary contention of these locks and false sharing of the cache lines containing tracker.total_words_processed and tracker.total_lines_processed.
Fix – Maintain the counts in a local variable for each thread. When the thread is about to exit, acquire the locks and update the global tracker. This is implemented in problem1_less_contention.cpp and reduced the runtime to 39ms. Note that this run only contains of this optimization, not the previous one. Running perf now gives,
Finally, both of these are implemented in problem1_opt.cpp and the runtime is 18ms. The speedup from the original code is 8.05 times. Running perf now gives,
2 Problem 2
The following points describe the various synchronization issues in the problem and the approach taken to solve them,
2.1 Starting of Threads
It is mentioned in the problem that the threads should contend for the file after all of them are created. A pthread_barrier_t reference is passed to all threads and they call wait on it.
2.2 Atomic Read of L Lines
The threads are required to read L lines atomically. For this a pointer to AtomicFile is passed to all threads, which is a wrapper around a file. A thread has to accquire a mutex before reading and releases it when the read of L lines is complete.
2.3 Atomic Write of L lines into the Buffer
After reading, the threads have to write these L lines into a FIFO buffer. For this, all threads including the consumer are passed a pointer to FIFOBuffer which is a wrapper around std::queue. The producer has to acquire a writer lock(to make sure there is only one writer at a time) before pushing and another lock to protect the queue and make sure that either the producer or the consumer have access to it at a time. It may be the case that while a producer thread has both the locks the buffer gets full and it is no longer able to write. In this case, the thread will wait on a condition variable. When the consumer finally pops something, it would sigal the producer. The case when there is nothing to pop in the consumer thread is handled in a similar way by another condition variable. The FIFOBuffer also maintains a reference count of the producer threads which may write more to the queue. When a producer thread has no more to read, it decrements this ref count. When this count falls to 0, the consumer can exit. Naturally, this ref count also needs to be protected by a lock.
3 Problem 3
For every pair of addresses, that contain at least one write, let’s perform the delta test,
3.1 A(i,j − i) and A(i,j − i − 1)
Checking for flow dependence,
2
i = i + ∆i =⇒ ∆i = 0
j − i = (j + ∆j) − (i + ∆i) − 1 =⇒ ∆j = 1
The direction vector is (o,+), which is valid and the Flow Dependence is carried by loop j. Checking for anti dependence,
i + ∆i = i =⇒ ∆i = 0
(j + ∆j) − (i + ∆i) = j − i − 1 =⇒ ∆j = −1
The direction vector is (o,−), which is invalid, so No anti dependence.
3.2 A(i,j − i) and A(i + 1,j − i)
Checking for flow dependence,
i = i + ∆i + 1 =⇒ ∆i = −1
j − i = (j + ∆j) − (i + ∆i) =⇒ ∆j = −1
The direction vector is (−,−), which is invalid, so No Flow dependence. Checking for anti dependence,
i + ∆i = i + 1 =⇒ ∆i = 1
(j + ∆j) − (i + ∆i) = j − i =⇒ ∆j = 1
The direction vector is (+,+), which is valid, so Anti Dependence is carried by loop i.
3.3 A(i,j − i) and A(i − 1,j + i − 1)
Checking for flow dependence,
i = i + ∆i − 1 =⇒ ∆i = 1
j − i = (j + ∆j) + (i + ∆i) − 1 =⇒ ∆j = −2 ∗ i
The direction vector is (+,−), which is valid, so Flow dependence is carried by loop i. Checking for anti dependence,
i + ∆i = i − 1 =⇒ ∆i = −1
(j + ∆j) − (i + ∆i) = j + i − 1 =⇒ ∆j = 2 ∗ i − 2
The direction vector is (−,+), which is invalid, so No Anti Dependence.
3.4 A(i,j − i) and A(i,j − i)
Checking for output dependence,
i = i + ∆i =⇒ ∆i = 0
j − i = (j + ∆j) + (i + ∆i) =⇒ ∆j = 0
So, No output dependence.
3
Reviews
There are no reviews yet.