<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>rhl</title>
	<atom:link href="http://ryanlewis.net/feed" rel="self" type="application/rss+xml" />
	<link>http://ryanlewis.net</link>
	<description>math nerd on the internet</description>
	<lastBuildDate>Tue, 17 Jan 2012 17:08:34 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>matrix-vector multiply with openmp</title>
		<link>http://ryanlewis.net/p/matrix-vector-omp</link>
		<comments>http://ryanlewis.net/p/matrix-vector-omp#comments</comments>
		<pubDate>Mon, 25 Jul 2011 16:07:22 +0000</pubDate>
		<dc:creator>rhl</dc:creator>
				<category><![CDATA[c++]]></category>
		<category><![CDATA[code]]></category>

		<guid isPermaLink="false">http://ryanlewis.net/?p=291</guid>
		<description><![CDATA[for this post lets try and take an &#8220;embarrassingly&#8221; parallel problem, and solve it in parallel using OpenMP. One such example is matrix-vector multiplication. For this problem I will use the Boost uBLAS library. Let&#8217;s first write our own matrix vector product which also runs in serial: which produces: Great, we now have a properly [...]]]></description>
			<content:encoded><![CDATA[<p>for this post lets try and take an &#8220;embarrassingly&#8221; parallel problem, and solve it in parallel using OpenMP. One such example is <a href="http://en.wikipedia.org/wiki/Matrix_multiplication">matrix-vector multiplication.</a></p>
<p>For this problem I will use the <a href="http://www.boost.org/doc/libs/1_38_0/libs/numeric/ublas/doc/index.htm">Boost uBLAS library</a>. </p>
<p>Let&#8217;s first write our own matrix vector product which also runs in serial: </p>
<pre class="brush: cpp; light: false; title: ; toolbar: true; notranslate">
//STL
#include&lt;iostream&gt;
//Boost
#include &lt;boost/numeric/ublas/matrix.hpp&gt;
#include &lt;boost/numeric/ublas/io.hpp&gt;
//Types
using namespace boost::numeric::ublas;
typedef matrix&lt;double&gt;; matrix_t;
typedef vector&lt;double&gt;; vector_t;

double random_double(){
      return (double)rand()/(double)RAND_MAX;
}
template&lt;typename Matrix, typename Vector&gt;;
void init_matrix_vector(Matrix &amp; m, Vector&amp; v){
	srand((unsigned)time(0));
	for (int i = 0; i &lt; m.size1(); ++i){
                v(i) = 10*random_double();
		for (int j = 0; j &lt; m.size2(); ++j)
			m(i,j) = 10*random_double()
	}
}

template&lt;typename Matrix, typename Vector&gt;;
void product(Matrix &amp; m, Vector &amp; v, Vector &amp; r){
	for(int i = 0; i &lt; m.size1(); ++i){
		r(i) = 0.0;
		for(int j = 0; j &lt; m.size2(); ++j){
			r(i) += m(i,j)*v(j);
		}
	}
}

int main(int argc,char * argv[]){
	if (argc != 2){
		std::cout &lt;&lt; &quot;usage: &quot;
                          &lt;&lt; argv[ 0]
                          &lt;&lt; &quot; N&quot;
                          &lt;&lt; std::endl;
		std::exit( -1);
	}
	int N = atoi(argv[ 1]);
        //create matrix/vector
	matrix_t m (N, N);
	vector_t v(N);
	init_matrix_vector(m,v);
	vector_t r(N);
	product(m,v,r);
	std::cout &lt;&lt; &quot;matrix: &quot; &lt;&lt; m &lt;&lt; std::endl;
	std::cout &lt;&lt; &quot;vector: &quot; &lt;&lt; v &lt;&lt; std::endl;
	std::cout &lt;&lt; &quot;boost product: &quot; &lt;&lt; prod(m,v) &lt;&lt; std::endl;
	std::cout &lt;&lt; &quot;our (serial) product: &quot; &lt;&lt; r &lt;&lt; std::endl;
}
</pre>
<p>which produces:</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
./a.out 2
matrix: [2,2]((3.69343,5.94287),(7.27436,3.44084))
vector: [2](0.72093,0.580349)
boost product: [2](6.11165,7.24119)
our (serial) product: [2](6.11165,7.24119)
</pre>
<p>Great, we now have a properly working product()<br />
function to which we can add parallelism.</p>
<p><b>Note</b> that this boils down to threading the outer for loop in our<br />
product() function.</p>
<p>Luckily OpenMP provides a compiler directive just for this. the &#8216;parallel for&#8217;<br />
construct. Consider the following modified product:</p>
<pre class="brush: cpp; light: false; title: ; toolbar: true; notranslate">
template&lt;typename Matrix, typename Vector&gt;;
void parallel_product(Matrix &amp; m, Vector &amp; v, Vector &amp; r){
	#pragma omp parallel for num_threads(2)
        for(int i = 0; i &lt; m.size1(); ++i){
                r(i) = 0.0;
                for(int j = 0; j &lt; m.size2(); ++j){
                        r(i) += m(i,j)*v(j);
                }
        }
}
</pre>
<p>Statements which begin with #pragma are pre-processor statements. The compilers<br />
pre-processor uses this to modify our source code before compilation.</p>
<p>This statement in particular informs the compiler to create chunks of work from<br />
the following for loop to run in a given set of threads threads.<br />
As one would expect, specifying the num_threads() clause informs the compiler<br />
how many threads to create. </p>
<p>Also, the for() loop could have iterated over a range, as long as the iterator type is a <a href="http://www.cplusplus.com/reference/std/iterator/RandomAccessIterator/">random access iterator</a>.  </p>
<p>Putting this all together we have:</p>
<pre class="brush: cpp; light: false; title: ; toolbar: true; notranslate">
//STL
#include&lt;iostream&gt;;

//OMP
#include&lt;omp.h&gt;;

//Boost
#include &lt;boost/numeric/ublas/matrix.hpp&gt;;
#include &lt;boost/numeric/ublas/io.hpp&gt;;

//Types
using namespace boost::numeric::ublas;
typedef matrix&lt;double&gt;; matrix_t;
typedef vector&lt;double&gt;; vector_t;

double random_double(){
	return (double)rand()/(double)RAND_MAX;
}

template&lt;typename Matrix, typename Vector&gt;;
void init_matrix_vector(Matrix &amp; m, Vector&amp; v){
	srand((unsigned)time(0));
	for (int i = 0; i &lt; m.size1(); ++i){
                v(i) = 10*random_double();
		for (int j = 0; j &lt; m.size2(); ++j)
			m(i,j) = 10*random_double();
	}
}

template&lt;typename Matrix, typename Vector&gt;;
void product(Matrix &amp; m, Vector &amp; v, Vector &amp; r){
	for(int i = 0; i &lt; m.size1(); ++i){
		r(i) = 0.0;
		for(int j = 0; j &lt; m.size2(); ++j){
			r(i) += m(i,j)*v(j);
		}
	}
}

template&lt;typename Matrix, typename Vector&gt;;
void parallel_product(Matrix &amp; m, Vector &amp; v, Vector &amp; r){
	#pragma omp parallel for num_threads(2)
        for(int i = 0; i &lt; m.size1(); ++i){
                r(i) = 0.0;
                for(int j = 0; j &lt; m.size2(); ++j){
                        r(i) += m(i,j)*v(j);
                }
        }
}

int main(int argc,char * argv[]){
	if (argc != 2){
		std::cout &lt;&lt; &quot;usage: &quot;
		          &lt;&lt; argv[ 0]
			  &lt;&lt; &quot; N&quot;
			  &lt;&lt; std::endl;
	}
	int N = atoi(argv[ 1]);
       	matrix_t m (N, N);
	vector_t v(N);
	init_matrix_vector(m,v);
	vector_t r(N);
	vector_t r1(N);
	double serial_time = omp_get_wtime();
	product(m,v,r);
	serial_time = omp_get_wtime() - serial_time;
	double parallel_time = omp_get_wtime();
	parallel_product(m,v,r1);
	parallel_time = omp_get_wtime() - parallel_time;
	for(int i = 0; i &lt; N; ++i){
		if (r(i) != r1(i)){
			std::cerr &lt;&lt; &quot;results differ @&quot;
				  &lt;&lt; i
				  &lt;&lt; std::endl;
			std::cerr &lt;&lt; r(i)
				  &lt;&lt; &quot; &quot;
				  &lt;&lt; r1(i)
				  &lt;&lt; std::endl;
			return -1;
		}
	}
	std::cout &lt;&lt; &quot;the result is the same&quot;
		  &lt;&lt; std::endl;
	std::cout &lt;&lt; &quot;serial time: &quot;
		  &lt;&lt; serial_time &lt;&lt; std::endl;
	std::cout &lt;&lt; &quot;parallel time: &quot;
		  &lt;&lt; parallel_time &lt;&lt; std::endl;
	return 0;
}
</pre>
<p>Which when I run on my dual-core laptop produces:</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
$ ./a.out 1000
the result is the same
serial time: 0.094412
parallel time: 0.048241

$ ./a.out 2000
the result is the same
serial time: 0.376123
parallel time: 0.191999

$ ./a.out 5000
the result is the same
serial time: 2.3542
parallel time: 1.21085

$ ./a.out 10000
the result is the same
serial time: 9.41049
parallel time: 4.84689
</pre>
<p>We can see that with two threads we nicely get ~2x speedup. </p>
<p>Notice the use of omp_get_wtime() as the timer for this code. </p>
<p>What happens if we time the code with std::clock() ?</p>
]]></content:encoded>
			<wfw:commentRss>http://ryanlewis.net/p/matrix-vector-omp/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>multi-threading with openmp</title>
		<link>http://ryanlewis.net/p/multi-threading-with-openmp</link>
		<comments>http://ryanlewis.net/p/multi-threading-with-openmp#comments</comments>
		<pubDate>Sun, 17 Jul 2011 12:27:43 +0000</pubDate>
		<dc:creator>rhl</dc:creator>
				<category><![CDATA[c++]]></category>
		<category><![CDATA[fedora]]></category>

		<guid isPermaLink="false">http://ryanlewis.net/?p=259</guid>
		<description><![CDATA[<a title="moore's law" href="http://en.wikipedia.org/wiki/Moore%27s_law">moore's law</a> is dead. at least that is what we keep hearing. In the meantime we programmers need something to speed up our codes. that something is parallel programming.]]></description>
			<content:encoded><![CDATA[<p><a title="moore's law" href="http://en.wikipedia.org/wiki/Moore%27s_law">moore&#8217;s law</a> is dead. at least that is what we keep hearing. personally, i think that while it might take time there will be talented engineers  who will figure out a way to keep moore&#8217;s law going. in the meantime we programmers need something to speed up our codes. that something is parallel programming.</p>
<p>most likely the machine you are using to read this blog post has a <a title="multi-core processor " href="http://en.wikipedia.org/wiki/Multi-core_processor">multi-core processor</a> in it. the idea is that your machine is capable of executing more than one instruction per clock tick. i.e if you have a total of four cores, then in theory, the programmer could make some computer programs run four times faster.</p>
<p>one way a programmer can take advantage of a multi-core machine is a language known as <a href="http://en.wikipedia.org/wiki/OpenMP" title="OpenMP">OpenMP</a> or &#8216;open multi-processing.&#8217;</p>
<p>let&#8217;s start with a simple familiar example: `hello world.`</p>
<pre class="brush: cpp; light: false; title: ; toolbar: true; notranslate">
$ cat main.cpp

#include&lt;iostream&gt;
int main(int argc, char** argv)
{
std::cout &lt;&lt; &quot;Hello, World!&quot; &lt;&lt; std::endl;
}

$ gcc main.cpp -lstdc++

$ ./a.out
Hello, World!
</pre>
<p>The OpenMP API provides a number of <a href="http://en.wikipedia.org/wiki/Preprocessor" title="preprocessor">preprocessor directives</a> for creating threads. A directive is an action and the greek word for action is &#8216;pragma.&#8217; thus to give a directive to the preprocessor we write:</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">#pragma omp [directive content here]</pre>
<p>The first program we will demonstrate is the `parallel` directive. This directive creates a number of threads which execute any instruction in the parallel region:</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
#include&lt;iostream&gt;
int main(int argc, char** argv)
{
	#pragma omp parallel
	std::cout &lt;&lt; &quot;Hello, World!&quot; &lt;&lt; std::endl;
}
</pre>
<p>Now in order to have the pre-processor actually process OpenMP directives, we need to enable the openmp flag in gcc: `-fopenmp`  </p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
$ gcc main.cpp -fopenmp -lstdc++
</pre>
<p>And now when we run our binary we see something like:</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
$ ./a.out
Hello, World!
Hello, World!
Hello, World!
</pre>
<p>And now we have written out first OpenMP program!<br />
Although if you run the program a few more times you will see:</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
$ ./a.out
Hello, World!Hello, World!

$ ./a.out
Hello, World!Hello, World!

$ ./a.out
Hello, World!
Hello, World!
$ ./a.out
Hello, World!Hello, World!
</pre>
<p>Notice how each line of our print statement executes in an arbitrary order! This is because each thread executes it&#8217;s print statement at a different time, and since there is no control which thread may execute this code, we sometimes will get unexpected results. </p>
<p>To correct this we need another directive, the &#8216;critical section.&#8217; This directive tells the pre-processor than any code which follows may only be executed by one processor at a time.</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
#include&lt;iostream&gt;
int main(int argc, char** argv)
{
  #pragma omp parallel
  {
     //All code in this block is in the parallel region
     #pragma omp critical
     {
        //only one thread is able to execute code
        //in a 'critical' section at a time.
        std::cout &lt;&lt; &quot;Hello, World!&quot; &lt;&lt; std::endl;
     }
   }
}
</pre>
<p>And now when we compile and run we will always get a similar result:</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
$ ./a.out
Hello, World!
Hello, World!
$ ./a.out
Hello, World!
Hello, World!
$ ./a.out
Hello, World!
Hello, World!
$ ./a.out
Hello, World!
Hello, World!
</pre>
<p>In addition to pre-compiler directives. The OpenMP API also provides a number of functions for parallel flow control.omp_get_thread_num() for example, gives us an integer 0,..,N-1 corresponding to which thread is executing the function.<br />
Consider:</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
#include&lt;iostream&gt;
#include &lt;omp.h&gt;
int main(int argc, char** argv)
{
  #pragma omp parallel
  {
     //All code in this block is in the parallel region
     #pragma omp critical
     {
        //only one thread is able to execute code
        //in a 'critical' section at a time.
        std::cout &lt;&lt; &quot;Hello, World: From Thread &quot;
                  &lt;&lt; omp_get_thread_num()
                  &lt;&lt; std::endl;
     }
   }
}
</pre>
<p>which executes:</p>
<pre class="brush: plain; light: false; title: ; toolbar: true; notranslate">
$ ./a.out
Hello, World: From Thread 0
Hello, World: From Thread 1
</pre>
]]></content:encoded>
			<wfw:commentRss>http://ryanlewis.net/p/multi-threading-with-openmp/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

