I'm totally new to SSE programming, but have an Intel Core i7 processor.

Basically, I want to take 4 32-bit unsigned integers and cube them all (raise to the power of 3) at once. It is my understanding that the SIMD functionality of SSE and its successors make this possible, but how in the world do I go about doing it? Preferably in C but I could manage assembly if necessary.

Edit to make clear my final goal:

Then, I want to add all the cubes together to come up with a single number.

Background: I'm just trying to use SSE to optimize figuring out if a number is an Armstrong number (a three-digit number whose sum of each digit cubed is the same as the number itself). An example is 153. There seems to be no way to do this other than brute force. These are a subset of Narcissistic numbers whose sum of all digits to the power of the length of the decimal number are equal to number itself. Hopefully, I'd like to eventually expand it to be more flexible, to start I'm just doing the Armstrong numbers. As you might imagine, this came up on another site and a few of us are trying to optimize the hell out of it. By taking your ideas and my own research, I came up with this code:

```
#include <stdio.h>
#include <smmintrin.h> // SSE 4.1
__m128i vcube(const __m128i v)
{
return _mm_mullo_epi32(v, _mm_mullo_epi32(v, v));
}
int main(int argc, const char * argv[]) {
for (unsigned int i = 1; i <= 500; i++) {
unsigned int firstDigit = i / 100;
unsigned int secondDigit = (i - firstDigit * 100) / 10;
unsigned int thirdDigit = (i - firstDigit * 100 - secondDigit * 10);
__m128i v = _mm_setr_epi32(0, firstDigit, secondDigit, thirdDigit);
__m128 v3 = (__m128) vcube(v);
v3 = _mm_hadd_ps(v3, v3);
v3 = _mm_hadd_ps(v3, v3);
if (_mm_extract_epi32((__m128i) v3, 0) == i)
printf ("%03d is an Armstrong number\n", i);
}
return 0;
}
```

Note: I had to do some type coercions to get it to compile in some systems (Solaris, at least some Linux).

So this works, but maybe it could be streamlined. Sorry I didn't post the whole task, but I was trying to break it down into steps and I wanted to make sure each digit was correctly cubed.

(END EDIT)

Thank you!

Edit: I guess I should add I'm running Mac OS X Sierra.

EDIT AGAIN:

So, let's say I make these all these unsigned shorts instead of unsigned ints and add more digits, how do I add them together when a short may not be able to hold the sum of all the digits? Is there a way to add them and store in a vector of larger variables if you know what I mean, or a plain larger number such as a UInt64?

Sorry for all the questions, but like I said I'm totally new at vector processing even though I had access to it since my first Mac G4.

If your input values are in the range 0..1625 (so that the result fits in 32 bits) then you can use `_mm_mullo_epi32`

:

```
__m128i vcube(const __m128i v)
{
return _mm_mullo_epi32(v, _mm_mullo_epi32(v, v));
}
```

Demo:

```
#include <stdio.h>
#include <smmintrin.h> // SSE 4.1
__m128i vcube(const __m128i v)
{
return _mm_mullo_epi32(v, _mm_mullo_epi32(v, v));
}
int main()
{
__m128i v = _mm_setr_epi32(0, 1, 1000, 1625);
__m128i v3 = vcube(v);
printf("%vlu => %vlu\n", v, v3);
return 0;
}
```

Compile and test:

```
$ gcc -Wall -Wno-format-invalid-specifier -Wno-format-extra-args -msse4 vcube.c && ./a.out
0 1 1000 1625 => 0 1 1000000000 4291015625
```