Alexandre Savard | 1b09e31 | 2012-08-07 20:33:29 -0400 | [diff] [blame] | 1 | #!/usr/bin/env perl |
| 2 | # |
| 3 | # ==================================================================== |
| 4 | # Written by Andy Polyakov <appro@openssl.org> for the OpenSSL |
| 5 | # project. The module is, however, dual licensed under OpenSSL and |
| 6 | # CRYPTOGAMS licenses depending on where you obtain it. For further |
| 7 | # details see http://www.openssl.org/~appro/cryptogams/. |
| 8 | # ==================================================================== |
| 9 | # |
| 10 | # March, May, June 2010 |
| 11 | # |
| 12 | # The module implements "4-bit" GCM GHASH function and underlying |
| 13 | # single multiplication operation in GF(2^128). "4-bit" means that it |
| 14 | # uses 256 bytes per-key table [+64/128 bytes fixed table]. It has two |
| 15 | # code paths: vanilla x86 and vanilla MMX. Former will be executed on |
| 16 | # 486 and Pentium, latter on all others. MMX GHASH features so called |
| 17 | # "528B" variant of "4-bit" method utilizing additional 256+16 bytes |
| 18 | # of per-key storage [+512 bytes shared table]. Performance results |
| 19 | # are for streamed GHASH subroutine and are expressed in cycles per |
| 20 | # processed byte, less is better: |
| 21 | # |
| 22 | # gcc 2.95.3(*) MMX assembler x86 assembler |
| 23 | # |
| 24 | # Pentium 105/111(**) - 50 |
| 25 | # PIII 68 /75 12.2 24 |
| 26 | # P4 125/125 17.8 84(***) |
| 27 | # Opteron 66 /70 10.1 30 |
| 28 | # Core2 54 /67 8.4 18 |
| 29 | # |
| 30 | # (*) gcc 3.4.x was observed to generate few percent slower code, |
| 31 | # which is one of reasons why 2.95.3 results were chosen, |
| 32 | # another reason is lack of 3.4.x results for older CPUs; |
| 33 | # comparison with MMX results is not completely fair, because C |
| 34 | # results are for vanilla "256B" implementation, while |
| 35 | # assembler results are for "528B";-) |
| 36 | # (**) second number is result for code compiled with -fPIC flag, |
| 37 | # which is actually more relevant, because assembler code is |
| 38 | # position-independent; |
| 39 | # (***) see comment in non-MMX routine for further details; |
| 40 | # |
| 41 | # To summarize, it's >2-5 times faster than gcc-generated code. To |
| 42 | # anchor it to something else SHA1 assembler processes one byte in |
| 43 | # 11-13 cycles on contemporary x86 cores. As for choice of MMX in |
| 44 | # particular, see comment at the end of the file... |
| 45 | |
| 46 | # May 2010 |
| 47 | # |
| 48 | # Add PCLMULQDQ version performing at 2.10 cycles per processed byte. |
| 49 | # The question is how close is it to theoretical limit? The pclmulqdq |
| 50 | # instruction latency appears to be 14 cycles and there can't be more |
| 51 | # than 2 of them executing at any given time. This means that single |
| 52 | # Karatsuba multiplication would take 28 cycles *plus* few cycles for |
| 53 | # pre- and post-processing. Then multiplication has to be followed by |
| 54 | # modulo-reduction. Given that aggregated reduction method [see |
| 55 | # "Carry-less Multiplication and Its Usage for Computing the GCM Mode" |
| 56 | # white paper by Intel] allows you to perform reduction only once in |
| 57 | # a while we can assume that asymptotic performance can be estimated |
| 58 | # as (28+Tmod/Naggr)/16, where Tmod is time to perform reduction |
| 59 | # and Naggr is the aggregation factor. |
| 60 | # |
| 61 | # Before we proceed to this implementation let's have closer look at |
| 62 | # the best-performing code suggested by Intel in their white paper. |
| 63 | # By tracing inter-register dependencies Tmod is estimated as ~19 |
| 64 | # cycles and Naggr chosen by Intel is 4, resulting in 2.05 cycles per |
| 65 | # processed byte. As implied, this is quite optimistic estimate, |
| 66 | # because it does not account for Karatsuba pre- and post-processing, |
| 67 | # which for a single multiplication is ~5 cycles. Unfortunately Intel |
| 68 | # does not provide performance data for GHASH alone. But benchmarking |
| 69 | # AES_GCM_encrypt ripped out of Fig. 15 of the white paper with aadt |
| 70 | # alone resulted in 2.46 cycles per byte of out 16KB buffer. Note that |
| 71 | # the result accounts even for pre-computing of degrees of the hash |
| 72 | # key H, but its portion is negligible at 16KB buffer size. |
| 73 | # |
| 74 | # Moving on to the implementation in question. Tmod is estimated as |
| 75 | # ~13 cycles and Naggr is 2, giving asymptotic performance of ... |
| 76 | # 2.16. How is it possible that measured performance is better than |
| 77 | # optimistic theoretical estimate? There is one thing Intel failed |
| 78 | # to recognize. By serializing GHASH with CTR in same subroutine |
| 79 | # former's performance is really limited to above (Tmul + Tmod/Naggr) |
| 80 | # equation. But if GHASH procedure is detached, the modulo-reduction |
| 81 | # can be interleaved with Naggr-1 multiplications at instruction level |
| 82 | # and under ideal conditions even disappear from the equation. So that |
| 83 | # optimistic theoretical estimate for this implementation is ... |
| 84 | # 28/16=1.75, and not 2.16. Well, it's probably way too optimistic, |
| 85 | # at least for such small Naggr. I'd argue that (28+Tproc/Naggr), |
| 86 | # where Tproc is time required for Karatsuba pre- and post-processing, |
| 87 | # is more realistic estimate. In this case it gives ... 1.91 cycles. |
| 88 | # Or in other words, depending on how well we can interleave reduction |
| 89 | # and one of the two multiplications the performance should be betwen |
| 90 | # 1.91 and 2.16. As already mentioned, this implementation processes |
| 91 | # one byte out of 8KB buffer in 2.10 cycles, while x86_64 counterpart |
| 92 | # - in 2.02. x86_64 performance is better, because larger register |
| 93 | # bank allows to interleave reduction and multiplication better. |
| 94 | # |
| 95 | # Does it make sense to increase Naggr? To start with it's virtually |
| 96 | # impossible in 32-bit mode, because of limited register bank |
| 97 | # capacity. Otherwise improvement has to be weighed agiainst slower |
| 98 | # setup, as well as code size and complexity increase. As even |
| 99 | # optimistic estimate doesn't promise 30% performance improvement, |
| 100 | # there are currently no plans to increase Naggr. |
| 101 | # |
| 102 | # Special thanks to David Woodhouse <dwmw2@infradead.org> for |
| 103 | # providing access to a Westmere-based system on behalf of Intel |
| 104 | # Open Source Technology Centre. |
| 105 | |
| 106 | # January 2010 |
| 107 | # |
| 108 | # Tweaked to optimize transitions between integer and FP operations |
| 109 | # on same XMM register, PCLMULQDQ subroutine was measured to process |
| 110 | # one byte in 2.07 cycles on Sandy Bridge, and in 2.12 - on Westmere. |
| 111 | # The minor regression on Westmere is outweighed by ~15% improvement |
| 112 | # on Sandy Bridge. Strangely enough attempt to modify 64-bit code in |
| 113 | # similar manner resulted in almost 20% degradation on Sandy Bridge, |
| 114 | # where original 64-bit code processes one byte in 1.95 cycles. |
| 115 | |
| 116 | $0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1; |
| 117 | push(@INC,"${dir}","${dir}../../perlasm"); |
| 118 | require "x86asm.pl"; |
| 119 | |
| 120 | &asm_init($ARGV[0],"ghash-x86.pl",$x86only = $ARGV[$#ARGV] eq "386"); |
| 121 | |
| 122 | $sse2=0; |
| 123 | for (@ARGV) { $sse2=1 if (/-DOPENSSL_IA32_SSE2/); } |
| 124 | |
| 125 | ($Zhh,$Zhl,$Zlh,$Zll) = ("ebp","edx","ecx","ebx"); |
| 126 | $inp = "edi"; |
| 127 | $Htbl = "esi"; |
| 128 | |
| 129 | $unroll = 0; # Affects x86 loop. Folded loop performs ~7% worse |
| 130 | # than unrolled, which has to be weighted against |
| 131 | # 2.5x x86-specific code size reduction. |
| 132 | |
| 133 | sub x86_loop { |
| 134 | my $off = shift; |
| 135 | my $rem = "eax"; |
| 136 | |
| 137 | &mov ($Zhh,&DWP(4,$Htbl,$Zll)); |
| 138 | &mov ($Zhl,&DWP(0,$Htbl,$Zll)); |
| 139 | &mov ($Zlh,&DWP(12,$Htbl,$Zll)); |
| 140 | &mov ($Zll,&DWP(8,$Htbl,$Zll)); |
| 141 | &xor ($rem,$rem); # avoid partial register stalls on PIII |
| 142 | |
| 143 | # shrd practically kills P4, 2.5x deterioration, but P4 has |
| 144 | # MMX code-path to execute. shrd runs tad faster [than twice |
| 145 | # the shifts, move's and or's] on pre-MMX Pentium (as well as |
| 146 | # PIII and Core2), *but* minimizes code size, spares register |
| 147 | # and thus allows to fold the loop... |
| 148 | if (!$unroll) { |
| 149 | my $cnt = $inp; |
| 150 | &mov ($cnt,15); |
| 151 | &jmp (&label("x86_loop")); |
| 152 | &set_label("x86_loop",16); |
| 153 | for($i=1;$i<=2;$i++) { |
| 154 | &mov (&LB($rem),&LB($Zll)); |
| 155 | &shrd ($Zll,$Zlh,4); |
| 156 | &and (&LB($rem),0xf); |
| 157 | &shrd ($Zlh,$Zhl,4); |
| 158 | &shrd ($Zhl,$Zhh,4); |
| 159 | &shr ($Zhh,4); |
| 160 | &xor ($Zhh,&DWP($off+16,"esp",$rem,4)); |
| 161 | |
| 162 | &mov (&LB($rem),&BP($off,"esp",$cnt)); |
| 163 | if ($i&1) { |
| 164 | &and (&LB($rem),0xf0); |
| 165 | } else { |
| 166 | &shl (&LB($rem),4); |
| 167 | } |
| 168 | |
| 169 | &xor ($Zll,&DWP(8,$Htbl,$rem)); |
| 170 | &xor ($Zlh,&DWP(12,$Htbl,$rem)); |
| 171 | &xor ($Zhl,&DWP(0,$Htbl,$rem)); |
| 172 | &xor ($Zhh,&DWP(4,$Htbl,$rem)); |
| 173 | |
| 174 | if ($i&1) { |
| 175 | &dec ($cnt); |
| 176 | &js (&label("x86_break")); |
| 177 | } else { |
| 178 | &jmp (&label("x86_loop")); |
| 179 | } |
| 180 | } |
| 181 | &set_label("x86_break",16); |
| 182 | } else { |
| 183 | for($i=1;$i<32;$i++) { |
| 184 | &comment($i); |
| 185 | &mov (&LB($rem),&LB($Zll)); |
| 186 | &shrd ($Zll,$Zlh,4); |
| 187 | &and (&LB($rem),0xf); |
| 188 | &shrd ($Zlh,$Zhl,4); |
| 189 | &shrd ($Zhl,$Zhh,4); |
| 190 | &shr ($Zhh,4); |
| 191 | &xor ($Zhh,&DWP($off+16,"esp",$rem,4)); |
| 192 | |
| 193 | if ($i&1) { |
| 194 | &mov (&LB($rem),&BP($off+15-($i>>1),"esp")); |
| 195 | &and (&LB($rem),0xf0); |
| 196 | } else { |
| 197 | &mov (&LB($rem),&BP($off+15-($i>>1),"esp")); |
| 198 | &shl (&LB($rem),4); |
| 199 | } |
| 200 | |
| 201 | &xor ($Zll,&DWP(8,$Htbl,$rem)); |
| 202 | &xor ($Zlh,&DWP(12,$Htbl,$rem)); |
| 203 | &xor ($Zhl,&DWP(0,$Htbl,$rem)); |
| 204 | &xor ($Zhh,&DWP(4,$Htbl,$rem)); |
| 205 | } |
| 206 | } |
| 207 | &bswap ($Zll); |
| 208 | &bswap ($Zlh); |
| 209 | &bswap ($Zhl); |
| 210 | if (!$x86only) { |
| 211 | &bswap ($Zhh); |
| 212 | } else { |
| 213 | &mov ("eax",$Zhh); |
| 214 | &bswap ("eax"); |
| 215 | &mov ($Zhh,"eax"); |
| 216 | } |
| 217 | } |
| 218 | |
| 219 | if ($unroll) { |
| 220 | &function_begin_B("_x86_gmult_4bit_inner"); |
| 221 | &x86_loop(4); |
| 222 | &ret (); |
| 223 | &function_end_B("_x86_gmult_4bit_inner"); |
| 224 | } |
| 225 | |
| 226 | sub deposit_rem_4bit { |
| 227 | my $bias = shift; |
| 228 | |
| 229 | &mov (&DWP($bias+0, "esp"),0x0000<<16); |
| 230 | &mov (&DWP($bias+4, "esp"),0x1C20<<16); |
| 231 | &mov (&DWP($bias+8, "esp"),0x3840<<16); |
| 232 | &mov (&DWP($bias+12,"esp"),0x2460<<16); |
| 233 | &mov (&DWP($bias+16,"esp"),0x7080<<16); |
| 234 | &mov (&DWP($bias+20,"esp"),0x6CA0<<16); |
| 235 | &mov (&DWP($bias+24,"esp"),0x48C0<<16); |
| 236 | &mov (&DWP($bias+28,"esp"),0x54E0<<16); |
| 237 | &mov (&DWP($bias+32,"esp"),0xE100<<16); |
| 238 | &mov (&DWP($bias+36,"esp"),0xFD20<<16); |
| 239 | &mov (&DWP($bias+40,"esp"),0xD940<<16); |
| 240 | &mov (&DWP($bias+44,"esp"),0xC560<<16); |
| 241 | &mov (&DWP($bias+48,"esp"),0x9180<<16); |
| 242 | &mov (&DWP($bias+52,"esp"),0x8DA0<<16); |
| 243 | &mov (&DWP($bias+56,"esp"),0xA9C0<<16); |
| 244 | &mov (&DWP($bias+60,"esp"),0xB5E0<<16); |
| 245 | } |
| 246 | |
| 247 | $suffix = $x86only ? "" : "_x86"; |
| 248 | |
| 249 | &function_begin("gcm_gmult_4bit".$suffix); |
| 250 | &stack_push(16+4+1); # +1 for stack alignment |
| 251 | &mov ($inp,&wparam(0)); # load Xi |
| 252 | &mov ($Htbl,&wparam(1)); # load Htable |
| 253 | |
| 254 | &mov ($Zhh,&DWP(0,$inp)); # load Xi[16] |
| 255 | &mov ($Zhl,&DWP(4,$inp)); |
| 256 | &mov ($Zlh,&DWP(8,$inp)); |
| 257 | &mov ($Zll,&DWP(12,$inp)); |
| 258 | |
| 259 | &deposit_rem_4bit(16); |
| 260 | |
| 261 | &mov (&DWP(0,"esp"),$Zhh); # copy Xi[16] on stack |
| 262 | &mov (&DWP(4,"esp"),$Zhl); |
| 263 | &mov (&DWP(8,"esp"),$Zlh); |
| 264 | &mov (&DWP(12,"esp"),$Zll); |
| 265 | &shr ($Zll,20); |
| 266 | &and ($Zll,0xf0); |
| 267 | |
| 268 | if ($unroll) { |
| 269 | &call ("_x86_gmult_4bit_inner"); |
| 270 | } else { |
| 271 | &x86_loop(0); |
| 272 | &mov ($inp,&wparam(0)); |
| 273 | } |
| 274 | |
| 275 | &mov (&DWP(12,$inp),$Zll); |
| 276 | &mov (&DWP(8,$inp),$Zlh); |
| 277 | &mov (&DWP(4,$inp),$Zhl); |
| 278 | &mov (&DWP(0,$inp),$Zhh); |
| 279 | &stack_pop(16+4+1); |
| 280 | &function_end("gcm_gmult_4bit".$suffix); |
| 281 | |
| 282 | &function_begin("gcm_ghash_4bit".$suffix); |
| 283 | &stack_push(16+4+1); # +1 for 64-bit alignment |
| 284 | &mov ($Zll,&wparam(0)); # load Xi |
| 285 | &mov ($Htbl,&wparam(1)); # load Htable |
| 286 | &mov ($inp,&wparam(2)); # load in |
| 287 | &mov ("ecx",&wparam(3)); # load len |
| 288 | &add ("ecx",$inp); |
| 289 | &mov (&wparam(3),"ecx"); |
| 290 | |
| 291 | &mov ($Zhh,&DWP(0,$Zll)); # load Xi[16] |
| 292 | &mov ($Zhl,&DWP(4,$Zll)); |
| 293 | &mov ($Zlh,&DWP(8,$Zll)); |
| 294 | &mov ($Zll,&DWP(12,$Zll)); |
| 295 | |
| 296 | &deposit_rem_4bit(16); |
| 297 | |
| 298 | &set_label("x86_outer_loop",16); |
| 299 | &xor ($Zll,&DWP(12,$inp)); # xor with input |
| 300 | &xor ($Zlh,&DWP(8,$inp)); |
| 301 | &xor ($Zhl,&DWP(4,$inp)); |
| 302 | &xor ($Zhh,&DWP(0,$inp)); |
| 303 | &mov (&DWP(12,"esp"),$Zll); # dump it on stack |
| 304 | &mov (&DWP(8,"esp"),$Zlh); |
| 305 | &mov (&DWP(4,"esp"),$Zhl); |
| 306 | &mov (&DWP(0,"esp"),$Zhh); |
| 307 | |
| 308 | &shr ($Zll,20); |
| 309 | &and ($Zll,0xf0); |
| 310 | |
| 311 | if ($unroll) { |
| 312 | &call ("_x86_gmult_4bit_inner"); |
| 313 | } else { |
| 314 | &x86_loop(0); |
| 315 | &mov ($inp,&wparam(2)); |
| 316 | } |
| 317 | &lea ($inp,&DWP(16,$inp)); |
| 318 | &cmp ($inp,&wparam(3)); |
| 319 | &mov (&wparam(2),$inp) if (!$unroll); |
| 320 | &jb (&label("x86_outer_loop")); |
| 321 | |
| 322 | &mov ($inp,&wparam(0)); # load Xi |
| 323 | &mov (&DWP(12,$inp),$Zll); |
| 324 | &mov (&DWP(8,$inp),$Zlh); |
| 325 | &mov (&DWP(4,$inp),$Zhl); |
| 326 | &mov (&DWP(0,$inp),$Zhh); |
| 327 | &stack_pop(16+4+1); |
| 328 | &function_end("gcm_ghash_4bit".$suffix); |
| 329 | |
| 330 | if (!$x86only) {{{ |
| 331 | |
| 332 | &static_label("rem_4bit"); |
| 333 | |
| 334 | if (!$sse2) {{ # pure-MMX "May" version... |
| 335 | |
| 336 | $S=12; # shift factor for rem_4bit |
| 337 | |
| 338 | &function_begin_B("_mmx_gmult_4bit_inner"); |
| 339 | # MMX version performs 3.5 times better on P4 (see comment in non-MMX |
| 340 | # routine for further details), 100% better on Opteron, ~70% better |
| 341 | # on Core2 and PIII... In other words effort is considered to be well |
| 342 | # spent... Since initial release the loop was unrolled in order to |
| 343 | # "liberate" register previously used as loop counter. Instead it's |
| 344 | # used to optimize critical path in 'Z.hi ^= rem_4bit[Z.lo&0xf]'. |
| 345 | # The path involves move of Z.lo from MMX to integer register, |
| 346 | # effective address calculation and finally merge of value to Z.hi. |
| 347 | # Reference to rem_4bit is scheduled so late that I had to >>4 |
| 348 | # rem_4bit elements. This resulted in 20-45% procent improvement |
| 349 | # on contemporary µ-archs. |
| 350 | { |
| 351 | my $cnt; |
| 352 | my $rem_4bit = "eax"; |
| 353 | my @rem = ($Zhh,$Zll); |
| 354 | my $nhi = $Zhl; |
| 355 | my $nlo = $Zlh; |
| 356 | |
| 357 | my ($Zlo,$Zhi) = ("mm0","mm1"); |
| 358 | my $tmp = "mm2"; |
| 359 | |
| 360 | &xor ($nlo,$nlo); # avoid partial register stalls on PIII |
| 361 | &mov ($nhi,$Zll); |
| 362 | &mov (&LB($nlo),&LB($nhi)); |
| 363 | &shl (&LB($nlo),4); |
| 364 | &and ($nhi,0xf0); |
| 365 | &movq ($Zlo,&QWP(8,$Htbl,$nlo)); |
| 366 | &movq ($Zhi,&QWP(0,$Htbl,$nlo)); |
| 367 | &movd ($rem[0],$Zlo); |
| 368 | |
| 369 | for ($cnt=28;$cnt>=-2;$cnt--) { |
| 370 | my $odd = $cnt&1; |
| 371 | my $nix = $odd ? $nlo : $nhi; |
| 372 | |
| 373 | &shl (&LB($nlo),4) if ($odd); |
| 374 | &psrlq ($Zlo,4); |
| 375 | &movq ($tmp,$Zhi); |
| 376 | &psrlq ($Zhi,4); |
| 377 | &pxor ($Zlo,&QWP(8,$Htbl,$nix)); |
| 378 | &mov (&LB($nlo),&BP($cnt/2,$inp)) if (!$odd && $cnt>=0); |
| 379 | &psllq ($tmp,60); |
| 380 | &and ($nhi,0xf0) if ($odd); |
| 381 | &pxor ($Zhi,&QWP(0,$rem_4bit,$rem[1],8)) if ($cnt<28); |
| 382 | &and ($rem[0],0xf); |
| 383 | &pxor ($Zhi,&QWP(0,$Htbl,$nix)); |
| 384 | &mov ($nhi,$nlo) if (!$odd && $cnt>=0); |
| 385 | &movd ($rem[1],$Zlo); |
| 386 | &pxor ($Zlo,$tmp); |
| 387 | |
| 388 | push (@rem,shift(@rem)); # "rotate" registers |
| 389 | } |
| 390 | |
| 391 | &mov ($inp,&DWP(4,$rem_4bit,$rem[1],8)); # last rem_4bit[rem] |
| 392 | |
| 393 | &psrlq ($Zlo,32); # lower part of Zlo is already there |
| 394 | &movd ($Zhl,$Zhi); |
| 395 | &psrlq ($Zhi,32); |
| 396 | &movd ($Zlh,$Zlo); |
| 397 | &movd ($Zhh,$Zhi); |
| 398 | &shl ($inp,4); # compensate for rem_4bit[i] being >>4 |
| 399 | |
| 400 | &bswap ($Zll); |
| 401 | &bswap ($Zhl); |
| 402 | &bswap ($Zlh); |
| 403 | &xor ($Zhh,$inp); |
| 404 | &bswap ($Zhh); |
| 405 | |
| 406 | &ret (); |
| 407 | } |
| 408 | &function_end_B("_mmx_gmult_4bit_inner"); |
| 409 | |
| 410 | &function_begin("gcm_gmult_4bit_mmx"); |
| 411 | &mov ($inp,&wparam(0)); # load Xi |
| 412 | &mov ($Htbl,&wparam(1)); # load Htable |
| 413 | |
| 414 | &call (&label("pic_point")); |
| 415 | &set_label("pic_point"); |
| 416 | &blindpop("eax"); |
| 417 | &lea ("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax")); |
| 418 | |
| 419 | &movz ($Zll,&BP(15,$inp)); |
| 420 | |
| 421 | &call ("_mmx_gmult_4bit_inner"); |
| 422 | |
| 423 | &mov ($inp,&wparam(0)); # load Xi |
| 424 | &emms (); |
| 425 | &mov (&DWP(12,$inp),$Zll); |
| 426 | &mov (&DWP(4,$inp),$Zhl); |
| 427 | &mov (&DWP(8,$inp),$Zlh); |
| 428 | &mov (&DWP(0,$inp),$Zhh); |
| 429 | &function_end("gcm_gmult_4bit_mmx"); |
| 430 | |
| 431 | # Streamed version performs 20% better on P4, 7% on Opteron, |
| 432 | # 10% on Core2 and PIII... |
| 433 | &function_begin("gcm_ghash_4bit_mmx"); |
| 434 | &mov ($Zhh,&wparam(0)); # load Xi |
| 435 | &mov ($Htbl,&wparam(1)); # load Htable |
| 436 | &mov ($inp,&wparam(2)); # load in |
| 437 | &mov ($Zlh,&wparam(3)); # load len |
| 438 | |
| 439 | &call (&label("pic_point")); |
| 440 | &set_label("pic_point"); |
| 441 | &blindpop("eax"); |
| 442 | &lea ("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax")); |
| 443 | |
| 444 | &add ($Zlh,$inp); |
| 445 | &mov (&wparam(3),$Zlh); # len to point at the end of input |
| 446 | &stack_push(4+1); # +1 for stack alignment |
| 447 | |
| 448 | &mov ($Zll,&DWP(12,$Zhh)); # load Xi[16] |
| 449 | &mov ($Zhl,&DWP(4,$Zhh)); |
| 450 | &mov ($Zlh,&DWP(8,$Zhh)); |
| 451 | &mov ($Zhh,&DWP(0,$Zhh)); |
| 452 | &jmp (&label("mmx_outer_loop")); |
| 453 | |
| 454 | &set_label("mmx_outer_loop",16); |
| 455 | &xor ($Zll,&DWP(12,$inp)); |
| 456 | &xor ($Zhl,&DWP(4,$inp)); |
| 457 | &xor ($Zlh,&DWP(8,$inp)); |
| 458 | &xor ($Zhh,&DWP(0,$inp)); |
| 459 | &mov (&wparam(2),$inp); |
| 460 | &mov (&DWP(12,"esp"),$Zll); |
| 461 | &mov (&DWP(4,"esp"),$Zhl); |
| 462 | &mov (&DWP(8,"esp"),$Zlh); |
| 463 | &mov (&DWP(0,"esp"),$Zhh); |
| 464 | |
| 465 | &mov ($inp,"esp"); |
| 466 | &shr ($Zll,24); |
| 467 | |
| 468 | &call ("_mmx_gmult_4bit_inner"); |
| 469 | |
| 470 | &mov ($inp,&wparam(2)); |
| 471 | &lea ($inp,&DWP(16,$inp)); |
| 472 | &cmp ($inp,&wparam(3)); |
| 473 | &jb (&label("mmx_outer_loop")); |
| 474 | |
| 475 | &mov ($inp,&wparam(0)); # load Xi |
| 476 | &emms (); |
| 477 | &mov (&DWP(12,$inp),$Zll); |
| 478 | &mov (&DWP(4,$inp),$Zhl); |
| 479 | &mov (&DWP(8,$inp),$Zlh); |
| 480 | &mov (&DWP(0,$inp),$Zhh); |
| 481 | |
| 482 | &stack_pop(4+1); |
| 483 | &function_end("gcm_ghash_4bit_mmx"); |
| 484 | |
| 485 | }} else {{ # "June" MMX version... |
| 486 | # ... has slower "April" gcm_gmult_4bit_mmx with folded |
| 487 | # loop. This is done to conserve code size... |
| 488 | $S=16; # shift factor for rem_4bit |
| 489 | |
| 490 | sub mmx_loop() { |
| 491 | # MMX version performs 2.8 times better on P4 (see comment in non-MMX |
| 492 | # routine for further details), 40% better on Opteron and Core2, 50% |
| 493 | # better on PIII... In other words effort is considered to be well |
| 494 | # spent... |
| 495 | my $inp = shift; |
| 496 | my $rem_4bit = shift; |
| 497 | my $cnt = $Zhh; |
| 498 | my $nhi = $Zhl; |
| 499 | my $nlo = $Zlh; |
| 500 | my $rem = $Zll; |
| 501 | |
| 502 | my ($Zlo,$Zhi) = ("mm0","mm1"); |
| 503 | my $tmp = "mm2"; |
| 504 | |
| 505 | &xor ($nlo,$nlo); # avoid partial register stalls on PIII |
| 506 | &mov ($nhi,$Zll); |
| 507 | &mov (&LB($nlo),&LB($nhi)); |
| 508 | &mov ($cnt,14); |
| 509 | &shl (&LB($nlo),4); |
| 510 | &and ($nhi,0xf0); |
| 511 | &movq ($Zlo,&QWP(8,$Htbl,$nlo)); |
| 512 | &movq ($Zhi,&QWP(0,$Htbl,$nlo)); |
| 513 | &movd ($rem,$Zlo); |
| 514 | &jmp (&label("mmx_loop")); |
| 515 | |
| 516 | &set_label("mmx_loop",16); |
| 517 | &psrlq ($Zlo,4); |
| 518 | &and ($rem,0xf); |
| 519 | &movq ($tmp,$Zhi); |
| 520 | &psrlq ($Zhi,4); |
| 521 | &pxor ($Zlo,&QWP(8,$Htbl,$nhi)); |
| 522 | &mov (&LB($nlo),&BP(0,$inp,$cnt)); |
| 523 | &psllq ($tmp,60); |
| 524 | &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8)); |
| 525 | &dec ($cnt); |
| 526 | &movd ($rem,$Zlo); |
| 527 | &pxor ($Zhi,&QWP(0,$Htbl,$nhi)); |
| 528 | &mov ($nhi,$nlo); |
| 529 | &pxor ($Zlo,$tmp); |
| 530 | &js (&label("mmx_break")); |
| 531 | |
| 532 | &shl (&LB($nlo),4); |
| 533 | &and ($rem,0xf); |
| 534 | &psrlq ($Zlo,4); |
| 535 | &and ($nhi,0xf0); |
| 536 | &movq ($tmp,$Zhi); |
| 537 | &psrlq ($Zhi,4); |
| 538 | &pxor ($Zlo,&QWP(8,$Htbl,$nlo)); |
| 539 | &psllq ($tmp,60); |
| 540 | &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8)); |
| 541 | &movd ($rem,$Zlo); |
| 542 | &pxor ($Zhi,&QWP(0,$Htbl,$nlo)); |
| 543 | &pxor ($Zlo,$tmp); |
| 544 | &jmp (&label("mmx_loop")); |
| 545 | |
| 546 | &set_label("mmx_break",16); |
| 547 | &shl (&LB($nlo),4); |
| 548 | &and ($rem,0xf); |
| 549 | &psrlq ($Zlo,4); |
| 550 | &and ($nhi,0xf0); |
| 551 | &movq ($tmp,$Zhi); |
| 552 | &psrlq ($Zhi,4); |
| 553 | &pxor ($Zlo,&QWP(8,$Htbl,$nlo)); |
| 554 | &psllq ($tmp,60); |
| 555 | &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8)); |
| 556 | &movd ($rem,$Zlo); |
| 557 | &pxor ($Zhi,&QWP(0,$Htbl,$nlo)); |
| 558 | &pxor ($Zlo,$tmp); |
| 559 | |
| 560 | &psrlq ($Zlo,4); |
| 561 | &and ($rem,0xf); |
| 562 | &movq ($tmp,$Zhi); |
| 563 | &psrlq ($Zhi,4); |
| 564 | &pxor ($Zlo,&QWP(8,$Htbl,$nhi)); |
| 565 | &psllq ($tmp,60); |
| 566 | &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8)); |
| 567 | &movd ($rem,$Zlo); |
| 568 | &pxor ($Zhi,&QWP(0,$Htbl,$nhi)); |
| 569 | &pxor ($Zlo,$tmp); |
| 570 | |
| 571 | &psrlq ($Zlo,32); # lower part of Zlo is already there |
| 572 | &movd ($Zhl,$Zhi); |
| 573 | &psrlq ($Zhi,32); |
| 574 | &movd ($Zlh,$Zlo); |
| 575 | &movd ($Zhh,$Zhi); |
| 576 | |
| 577 | &bswap ($Zll); |
| 578 | &bswap ($Zhl); |
| 579 | &bswap ($Zlh); |
| 580 | &bswap ($Zhh); |
| 581 | } |
| 582 | |
| 583 | &function_begin("gcm_gmult_4bit_mmx"); |
| 584 | &mov ($inp,&wparam(0)); # load Xi |
| 585 | &mov ($Htbl,&wparam(1)); # load Htable |
| 586 | |
| 587 | &call (&label("pic_point")); |
| 588 | &set_label("pic_point"); |
| 589 | &blindpop("eax"); |
| 590 | &lea ("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax")); |
| 591 | |
| 592 | &movz ($Zll,&BP(15,$inp)); |
| 593 | |
| 594 | &mmx_loop($inp,"eax"); |
| 595 | |
| 596 | &emms (); |
| 597 | &mov (&DWP(12,$inp),$Zll); |
| 598 | &mov (&DWP(4,$inp),$Zhl); |
| 599 | &mov (&DWP(8,$inp),$Zlh); |
| 600 | &mov (&DWP(0,$inp),$Zhh); |
| 601 | &function_end("gcm_gmult_4bit_mmx"); |
| 602 | |
| 603 | ###################################################################### |
| 604 | # Below subroutine is "528B" variant of "4-bit" GCM GHASH function |
| 605 | # (see gcm128.c for details). It provides further 20-40% performance |
| 606 | # improvement over above mentioned "May" version. |
| 607 | |
| 608 | &static_label("rem_8bit"); |
| 609 | |
| 610 | &function_begin("gcm_ghash_4bit_mmx"); |
| 611 | { my ($Zlo,$Zhi) = ("mm7","mm6"); |
| 612 | my $rem_8bit = "esi"; |
| 613 | my $Htbl = "ebx"; |
| 614 | |
| 615 | # parameter block |
| 616 | &mov ("eax",&wparam(0)); # Xi |
| 617 | &mov ("ebx",&wparam(1)); # Htable |
| 618 | &mov ("ecx",&wparam(2)); # inp |
| 619 | &mov ("edx",&wparam(3)); # len |
| 620 | &mov ("ebp","esp"); # original %esp |
| 621 | &call (&label("pic_point")); |
| 622 | &set_label ("pic_point"); |
| 623 | &blindpop ($rem_8bit); |
| 624 | &lea ($rem_8bit,&DWP(&label("rem_8bit")."-".&label("pic_point"),$rem_8bit)); |
| 625 | |
| 626 | &sub ("esp",512+16+16); # allocate stack frame... |
| 627 | &and ("esp",-64); # ...and align it |
| 628 | &sub ("esp",16); # place for (u8)(H[]<<4) |
| 629 | |
| 630 | &add ("edx","ecx"); # pointer to the end of input |
| 631 | &mov (&DWP(528+16+0,"esp"),"eax"); # save Xi |
| 632 | &mov (&DWP(528+16+8,"esp"),"edx"); # save inp+len |
| 633 | &mov (&DWP(528+16+12,"esp"),"ebp"); # save original %esp |
| 634 | |
| 635 | { my @lo = ("mm0","mm1","mm2"); |
| 636 | my @hi = ("mm3","mm4","mm5"); |
| 637 | my @tmp = ("mm6","mm7"); |
| 638 | my $off1=0,$off2=0,$i; |
| 639 | |
| 640 | &add ($Htbl,128); # optimize for size |
| 641 | &lea ("edi",&DWP(16+128,"esp")); |
| 642 | &lea ("ebp",&DWP(16+256+128,"esp")); |
| 643 | |
| 644 | # decompose Htable (low and high parts are kept separately), |
| 645 | # generate Htable[]>>4, (u8)(Htable[]<<4), save to stack... |
| 646 | for ($i=0;$i<18;$i++) { |
| 647 | |
| 648 | &mov ("edx",&DWP(16*$i+8-128,$Htbl)) if ($i<16); |
| 649 | &movq ($lo[0],&QWP(16*$i+8-128,$Htbl)) if ($i<16); |
| 650 | &psllq ($tmp[1],60) if ($i>1); |
| 651 | &movq ($hi[0],&QWP(16*$i+0-128,$Htbl)) if ($i<16); |
| 652 | &por ($lo[2],$tmp[1]) if ($i>1); |
| 653 | &movq (&QWP($off1-128,"edi"),$lo[1]) if ($i>0 && $i<17); |
| 654 | &psrlq ($lo[1],4) if ($i>0 && $i<17); |
| 655 | &movq (&QWP($off1,"edi"),$hi[1]) if ($i>0 && $i<17); |
| 656 | &movq ($tmp[0],$hi[1]) if ($i>0 && $i<17); |
| 657 | &movq (&QWP($off2-128,"ebp"),$lo[2]) if ($i>1); |
| 658 | &psrlq ($hi[1],4) if ($i>0 && $i<17); |
| 659 | &movq (&QWP($off2,"ebp"),$hi[2]) if ($i>1); |
| 660 | &shl ("edx",4) if ($i<16); |
| 661 | &mov (&BP($i,"esp"),&LB("edx")) if ($i<16); |
| 662 | |
| 663 | unshift (@lo,pop(@lo)); # "rotate" registers |
| 664 | unshift (@hi,pop(@hi)); |
| 665 | unshift (@tmp,pop(@tmp)); |
| 666 | $off1 += 8 if ($i>0); |
| 667 | $off2 += 8 if ($i>1); |
| 668 | } |
| 669 | } |
| 670 | |
| 671 | &movq ($Zhi,&QWP(0,"eax")); |
| 672 | &mov ("ebx",&DWP(8,"eax")); |
| 673 | &mov ("edx",&DWP(12,"eax")); # load Xi |
| 674 | |
| 675 | &set_label("outer",16); |
| 676 | { my $nlo = "eax"; |
| 677 | my $dat = "edx"; |
| 678 | my @nhi = ("edi","ebp"); |
| 679 | my @rem = ("ebx","ecx"); |
| 680 | my @red = ("mm0","mm1","mm2"); |
| 681 | my $tmp = "mm3"; |
| 682 | |
| 683 | &xor ($dat,&DWP(12,"ecx")); # merge input data |
| 684 | &xor ("ebx",&DWP(8,"ecx")); |
| 685 | &pxor ($Zhi,&QWP(0,"ecx")); |
| 686 | &lea ("ecx",&DWP(16,"ecx")); # inp+=16 |
| 687 | #&mov (&DWP(528+12,"esp"),$dat); # save inp^Xi |
| 688 | &mov (&DWP(528+8,"esp"),"ebx"); |
| 689 | &movq (&QWP(528+0,"esp"),$Zhi); |
| 690 | &mov (&DWP(528+16+4,"esp"),"ecx"); # save inp |
| 691 | |
| 692 | &xor ($nlo,$nlo); |
| 693 | &rol ($dat,8); |
| 694 | &mov (&LB($nlo),&LB($dat)); |
| 695 | &mov ($nhi[1],$nlo); |
| 696 | &and (&LB($nlo),0x0f); |
| 697 | &shr ($nhi[1],4); |
| 698 | &pxor ($red[0],$red[0]); |
| 699 | &rol ($dat,8); # next byte |
| 700 | &pxor ($red[1],$red[1]); |
| 701 | &pxor ($red[2],$red[2]); |
| 702 | |
| 703 | # Just like in "May" verson modulo-schedule for critical path in |
| 704 | # 'Z.hi ^= rem_8bit[Z.lo&0xff^((u8)H[nhi]<<4)]<<48'. Final 'pxor' |
| 705 | # is scheduled so late that rem_8bit[] has to be shifted *right* |
| 706 | # by 16, which is why last argument to pinsrw is 2, which |
| 707 | # corresponds to <<32=<<48>>16... |
| 708 | for ($j=11,$i=0;$i<15;$i++) { |
| 709 | |
| 710 | if ($i>0) { |
| 711 | &pxor ($Zlo,&QWP(16,"esp",$nlo,8)); # Z^=H[nlo] |
| 712 | &rol ($dat,8); # next byte |
| 713 | &pxor ($Zhi,&QWP(16+128,"esp",$nlo,8)); |
| 714 | |
| 715 | &pxor ($Zlo,$tmp); |
| 716 | &pxor ($Zhi,&QWP(16+256+128,"esp",$nhi[0],8)); |
| 717 | &xor (&LB($rem[1]),&BP(0,"esp",$nhi[0])); # rem^(H[nhi]<<4) |
| 718 | } else { |
| 719 | &movq ($Zlo,&QWP(16,"esp",$nlo,8)); |
| 720 | &movq ($Zhi,&QWP(16+128,"esp",$nlo,8)); |
| 721 | } |
| 722 | |
| 723 | &mov (&LB($nlo),&LB($dat)); |
| 724 | &mov ($dat,&DWP(528+$j,"esp")) if (--$j%4==0); |
| 725 | |
| 726 | &movd ($rem[0],$Zlo); |
| 727 | &movz ($rem[1],&LB($rem[1])) if ($i>0); |
| 728 | &psrlq ($Zlo,8); # Z>>=8 |
| 729 | |
| 730 | &movq ($tmp,$Zhi); |
| 731 | &mov ($nhi[0],$nlo); |
| 732 | &psrlq ($Zhi,8); |
| 733 | |
| 734 | &pxor ($Zlo,&QWP(16+256+0,"esp",$nhi[1],8)); # Z^=H[nhi]>>4 |
| 735 | &and (&LB($nlo),0x0f); |
| 736 | &psllq ($tmp,56); |
| 737 | |
| 738 | &pxor ($Zhi,$red[1]) if ($i>1); |
| 739 | &shr ($nhi[0],4); |
| 740 | &pinsrw ($red[0],&WP(0,$rem_8bit,$rem[1],2),2) if ($i>0); |
| 741 | |
| 742 | unshift (@red,pop(@red)); # "rotate" registers |
| 743 | unshift (@rem,pop(@rem)); |
| 744 | unshift (@nhi,pop(@nhi)); |
| 745 | } |
| 746 | |
| 747 | &pxor ($Zlo,&QWP(16,"esp",$nlo,8)); # Z^=H[nlo] |
| 748 | &pxor ($Zhi,&QWP(16+128,"esp",$nlo,8)); |
| 749 | &xor (&LB($rem[1]),&BP(0,"esp",$nhi[0])); # rem^(H[nhi]<<4) |
| 750 | |
| 751 | &pxor ($Zlo,$tmp); |
| 752 | &pxor ($Zhi,&QWP(16+256+128,"esp",$nhi[0],8)); |
| 753 | &movz ($rem[1],&LB($rem[1])); |
| 754 | |
| 755 | &pxor ($red[2],$red[2]); # clear 2nd word |
| 756 | &psllq ($red[1],4); |
| 757 | |
| 758 | &movd ($rem[0],$Zlo); |
| 759 | &psrlq ($Zlo,4); # Z>>=4 |
| 760 | |
| 761 | &movq ($tmp,$Zhi); |
| 762 | &psrlq ($Zhi,4); |
| 763 | &shl ($rem[0],4); # rem<<4 |
| 764 | |
| 765 | &pxor ($Zlo,&QWP(16,"esp",$nhi[1],8)); # Z^=H[nhi] |
| 766 | &psllq ($tmp,60); |
| 767 | &movz ($rem[0],&LB($rem[0])); |
| 768 | |
| 769 | &pxor ($Zlo,$tmp); |
| 770 | &pxor ($Zhi,&QWP(16+128,"esp",$nhi[1],8)); |
| 771 | |
| 772 | &pinsrw ($red[0],&WP(0,$rem_8bit,$rem[1],2),2); |
| 773 | &pxor ($Zhi,$red[1]); |
| 774 | |
| 775 | &movd ($dat,$Zlo); |
| 776 | &pinsrw ($red[2],&WP(0,$rem_8bit,$rem[0],2),3); # last is <<48 |
| 777 | |
| 778 | &psllq ($red[0],12); # correct by <<16>>4 |
| 779 | &pxor ($Zhi,$red[0]); |
| 780 | &psrlq ($Zlo,32); |
| 781 | &pxor ($Zhi,$red[2]); |
| 782 | |
| 783 | &mov ("ecx",&DWP(528+16+4,"esp")); # restore inp |
| 784 | &movd ("ebx",$Zlo); |
| 785 | &movq ($tmp,$Zhi); # 01234567 |
| 786 | &psllw ($Zhi,8); # 1.3.5.7. |
| 787 | &psrlw ($tmp,8); # .0.2.4.6 |
| 788 | &por ($Zhi,$tmp); # 10325476 |
| 789 | &bswap ($dat); |
| 790 | &pshufw ($Zhi,$Zhi,0b00011011); # 76543210 |
| 791 | &bswap ("ebx"); |
| 792 | |
| 793 | &cmp ("ecx",&DWP(528+16+8,"esp")); # are we done? |
| 794 | &jne (&label("outer")); |
| 795 | } |
| 796 | |
| 797 | &mov ("eax",&DWP(528+16+0,"esp")); # restore Xi |
| 798 | &mov (&DWP(12,"eax"),"edx"); |
| 799 | &mov (&DWP(8,"eax"),"ebx"); |
| 800 | &movq (&QWP(0,"eax"),$Zhi); |
| 801 | |
| 802 | &mov ("esp",&DWP(528+16+12,"esp")); # restore original %esp |
| 803 | &emms (); |
| 804 | } |
| 805 | &function_end("gcm_ghash_4bit_mmx"); |
| 806 | }} |
| 807 | |
| 808 | if ($sse2) {{ |
| 809 | ###################################################################### |
| 810 | # PCLMULQDQ version. |
| 811 | |
| 812 | $Xip="eax"; |
| 813 | $Htbl="edx"; |
| 814 | $const="ecx"; |
| 815 | $inp="esi"; |
| 816 | $len="ebx"; |
| 817 | |
| 818 | ($Xi,$Xhi)=("xmm0","xmm1"); $Hkey="xmm2"; |
| 819 | ($T1,$T2,$T3)=("xmm3","xmm4","xmm5"); |
| 820 | ($Xn,$Xhn)=("xmm6","xmm7"); |
| 821 | |
| 822 | &static_label("bswap"); |
| 823 | |
| 824 | sub clmul64x64_T2 { # minimal "register" pressure |
| 825 | my ($Xhi,$Xi,$Hkey)=@_; |
| 826 | |
| 827 | &movdqa ($Xhi,$Xi); # |
| 828 | &pshufd ($T1,$Xi,0b01001110); |
| 829 | &pshufd ($T2,$Hkey,0b01001110); |
| 830 | &pxor ($T1,$Xi); # |
| 831 | &pxor ($T2,$Hkey); |
| 832 | |
| 833 | &pclmulqdq ($Xi,$Hkey,0x00); ####### |
| 834 | &pclmulqdq ($Xhi,$Hkey,0x11); ####### |
| 835 | &pclmulqdq ($T1,$T2,0x00); ####### |
| 836 | &xorps ($T1,$Xi); # |
| 837 | &xorps ($T1,$Xhi); # |
| 838 | |
| 839 | &movdqa ($T2,$T1); # |
| 840 | &psrldq ($T1,8); |
| 841 | &pslldq ($T2,8); # |
| 842 | &pxor ($Xhi,$T1); |
| 843 | &pxor ($Xi,$T2); # |
| 844 | } |
| 845 | |
| 846 | sub clmul64x64_T3 { |
| 847 | # Even though this subroutine offers visually better ILP, it |
| 848 | # was empirically found to be a tad slower than above version. |
| 849 | # At least in gcm_ghash_clmul context. But it's just as well, |
| 850 | # because loop modulo-scheduling is possible only thanks to |
| 851 | # minimized "register" pressure... |
| 852 | my ($Xhi,$Xi,$Hkey)=@_; |
| 853 | |
| 854 | &movdqa ($T1,$Xi); # |
| 855 | &movdqa ($Xhi,$Xi); |
| 856 | &pclmulqdq ($Xi,$Hkey,0x00); ####### |
| 857 | &pclmulqdq ($Xhi,$Hkey,0x11); ####### |
| 858 | &pshufd ($T2,$T1,0b01001110); # |
| 859 | &pshufd ($T3,$Hkey,0b01001110); |
| 860 | &pxor ($T2,$T1); # |
| 861 | &pxor ($T3,$Hkey); |
| 862 | &pclmulqdq ($T2,$T3,0x00); ####### |
| 863 | &pxor ($T2,$Xi); # |
| 864 | &pxor ($T2,$Xhi); # |
| 865 | |
| 866 | &movdqa ($T3,$T2); # |
| 867 | &psrldq ($T2,8); |
| 868 | &pslldq ($T3,8); # |
| 869 | &pxor ($Xhi,$T2); |
| 870 | &pxor ($Xi,$T3); # |
| 871 | } |
| 872 | |
| 873 | if (1) { # Algorithm 9 with <<1 twist. |
| 874 | # Reduction is shorter and uses only two |
| 875 | # temporary registers, which makes it better |
| 876 | # candidate for interleaving with 64x64 |
| 877 | # multiplication. Pre-modulo-scheduled loop |
| 878 | # was found to be ~20% faster than Algorithm 5 |
| 879 | # below. Algorithm 9 was therefore chosen for |
| 880 | # further optimization... |
| 881 | |
| 882 | sub reduction_alg9 { # 17/13 times faster than Intel version |
| 883 | my ($Xhi,$Xi) = @_; |
| 884 | |
| 885 | # 1st phase |
| 886 | &movdqa ($T1,$Xi) # |
| 887 | &psllq ($Xi,1); |
| 888 | &pxor ($Xi,$T1); # |
| 889 | &psllq ($Xi,5); # |
| 890 | &pxor ($Xi,$T1); # |
| 891 | &psllq ($Xi,57); # |
| 892 | &movdqa ($T2,$Xi); # |
| 893 | &pslldq ($Xi,8); |
| 894 | &psrldq ($T2,8); # |
| 895 | &pxor ($Xi,$T1); |
| 896 | &pxor ($Xhi,$T2); # |
| 897 | |
| 898 | # 2nd phase |
| 899 | &movdqa ($T2,$Xi); |
| 900 | &psrlq ($Xi,5); |
| 901 | &pxor ($Xi,$T2); # |
| 902 | &psrlq ($Xi,1); # |
| 903 | &pxor ($Xi,$T2); # |
| 904 | &pxor ($T2,$Xhi); |
| 905 | &psrlq ($Xi,1); # |
| 906 | &pxor ($Xi,$T2); # |
| 907 | } |
| 908 | |
| 909 | &function_begin_B("gcm_init_clmul"); |
| 910 | &mov ($Htbl,&wparam(0)); |
| 911 | &mov ($Xip,&wparam(1)); |
| 912 | |
| 913 | &call (&label("pic")); |
| 914 | &set_label("pic"); |
| 915 | &blindpop ($const); |
| 916 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); |
| 917 | |
| 918 | &movdqu ($Hkey,&QWP(0,$Xip)); |
| 919 | &pshufd ($Hkey,$Hkey,0b01001110);# dword swap |
| 920 | |
| 921 | # <<1 twist |
| 922 | &pshufd ($T2,$Hkey,0b11111111); # broadcast uppermost dword |
| 923 | &movdqa ($T1,$Hkey); |
| 924 | &psllq ($Hkey,1); |
| 925 | &pxor ($T3,$T3); # |
| 926 | &psrlq ($T1,63); |
| 927 | &pcmpgtd ($T3,$T2); # broadcast carry bit |
| 928 | &pslldq ($T1,8); |
| 929 | &por ($Hkey,$T1); # H<<=1 |
| 930 | |
| 931 | # magic reduction |
| 932 | &pand ($T3,&QWP(16,$const)); # 0x1c2_polynomial |
| 933 | &pxor ($Hkey,$T3); # if(carry) H^=0x1c2_polynomial |
| 934 | |
| 935 | # calculate H^2 |
| 936 | &movdqa ($Xi,$Hkey); |
| 937 | &clmul64x64_T2 ($Xhi,$Xi,$Hkey); |
| 938 | &reduction_alg9 ($Xhi,$Xi); |
| 939 | |
| 940 | &movdqu (&QWP(0,$Htbl),$Hkey); # save H |
| 941 | &movdqu (&QWP(16,$Htbl),$Xi); # save H^2 |
| 942 | |
| 943 | &ret (); |
| 944 | &function_end_B("gcm_init_clmul"); |
| 945 | |
| 946 | &function_begin_B("gcm_gmult_clmul"); |
| 947 | &mov ($Xip,&wparam(0)); |
| 948 | &mov ($Htbl,&wparam(1)); |
| 949 | |
| 950 | &call (&label("pic")); |
| 951 | &set_label("pic"); |
| 952 | &blindpop ($const); |
| 953 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); |
| 954 | |
| 955 | &movdqu ($Xi,&QWP(0,$Xip)); |
| 956 | &movdqa ($T3,&QWP(0,$const)); |
| 957 | &movups ($Hkey,&QWP(0,$Htbl)); |
| 958 | &pshufb ($Xi,$T3); |
| 959 | |
| 960 | &clmul64x64_T2 ($Xhi,$Xi,$Hkey); |
| 961 | &reduction_alg9 ($Xhi,$Xi); |
| 962 | |
| 963 | &pshufb ($Xi,$T3); |
| 964 | &movdqu (&QWP(0,$Xip),$Xi); |
| 965 | |
| 966 | &ret (); |
| 967 | &function_end_B("gcm_gmult_clmul"); |
| 968 | |
| 969 | &function_begin("gcm_ghash_clmul"); |
| 970 | &mov ($Xip,&wparam(0)); |
| 971 | &mov ($Htbl,&wparam(1)); |
| 972 | &mov ($inp,&wparam(2)); |
| 973 | &mov ($len,&wparam(3)); |
| 974 | |
| 975 | &call (&label("pic")); |
| 976 | &set_label("pic"); |
| 977 | &blindpop ($const); |
| 978 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); |
| 979 | |
| 980 | &movdqu ($Xi,&QWP(0,$Xip)); |
| 981 | &movdqa ($T3,&QWP(0,$const)); |
| 982 | &movdqu ($Hkey,&QWP(0,$Htbl)); |
| 983 | &pshufb ($Xi,$T3); |
| 984 | |
| 985 | &sub ($len,0x10); |
| 986 | &jz (&label("odd_tail")); |
| 987 | |
| 988 | ####### |
| 989 | # Xi+2 =[H*(Ii+1 + Xi+1)] mod P = |
| 990 | # [(H*Ii+1) + (H*Xi+1)] mod P = |
| 991 | # [(H*Ii+1) + H^2*(Ii+Xi)] mod P |
| 992 | # |
| 993 | &movdqu ($T1,&QWP(0,$inp)); # Ii |
| 994 | &movdqu ($Xn,&QWP(16,$inp)); # Ii+1 |
| 995 | &pshufb ($T1,$T3); |
| 996 | &pshufb ($Xn,$T3); |
| 997 | &pxor ($Xi,$T1); # Ii+Xi |
| 998 | |
| 999 | &clmul64x64_T2 ($Xhn,$Xn,$Hkey); # H*Ii+1 |
| 1000 | &movups ($Hkey,&QWP(16,$Htbl)); # load H^2 |
| 1001 | |
| 1002 | &lea ($inp,&DWP(32,$inp)); # i+=2 |
| 1003 | &sub ($len,0x20); |
| 1004 | &jbe (&label("even_tail")); |
| 1005 | |
| 1006 | &set_label("mod_loop"); |
| 1007 | &clmul64x64_T2 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi) |
| 1008 | &movdqu ($T1,&QWP(0,$inp)); # Ii |
| 1009 | &movups ($Hkey,&QWP(0,$Htbl)); # load H |
| 1010 | |
| 1011 | &pxor ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi) |
| 1012 | &pxor ($Xhi,$Xhn); |
| 1013 | |
| 1014 | &movdqu ($Xn,&QWP(16,$inp)); # Ii+1 |
| 1015 | &pshufb ($T1,$T3); |
| 1016 | &pshufb ($Xn,$T3); |
| 1017 | |
| 1018 | &movdqa ($T3,$Xn); #&clmul64x64_TX ($Xhn,$Xn,$Hkey); H*Ii+1 |
| 1019 | &movdqa ($Xhn,$Xn); |
| 1020 | &pxor ($Xhi,$T1); # "Ii+Xi", consume early |
| 1021 | |
| 1022 | &movdqa ($T1,$Xi) #&reduction_alg9($Xhi,$Xi); 1st phase |
| 1023 | &psllq ($Xi,1); |
| 1024 | &pxor ($Xi,$T1); # |
| 1025 | &psllq ($Xi,5); # |
| 1026 | &pxor ($Xi,$T1); # |
| 1027 | &pclmulqdq ($Xn,$Hkey,0x00); ####### |
| 1028 | &psllq ($Xi,57); # |
| 1029 | &movdqa ($T2,$Xi); # |
| 1030 | &pslldq ($Xi,8); |
| 1031 | &psrldq ($T2,8); # |
| 1032 | &pxor ($Xi,$T1); |
| 1033 | &pshufd ($T1,$T3,0b01001110); |
| 1034 | &pxor ($Xhi,$T2); # |
| 1035 | &pxor ($T1,$T3); |
| 1036 | &pshufd ($T3,$Hkey,0b01001110); |
| 1037 | &pxor ($T3,$Hkey); # |
| 1038 | |
| 1039 | &pclmulqdq ($Xhn,$Hkey,0x11); ####### |
| 1040 | &movdqa ($T2,$Xi); # 2nd phase |
| 1041 | &psrlq ($Xi,5); |
| 1042 | &pxor ($Xi,$T2); # |
| 1043 | &psrlq ($Xi,1); # |
| 1044 | &pxor ($Xi,$T2); # |
| 1045 | &pxor ($T2,$Xhi); |
| 1046 | &psrlq ($Xi,1); # |
| 1047 | &pxor ($Xi,$T2); # |
| 1048 | |
| 1049 | &pclmulqdq ($T1,$T3,0x00); ####### |
| 1050 | &movups ($Hkey,&QWP(16,$Htbl)); # load H^2 |
| 1051 | &xorps ($T1,$Xn); # |
| 1052 | &xorps ($T1,$Xhn); # |
| 1053 | |
| 1054 | &movdqa ($T3,$T1); # |
| 1055 | &psrldq ($T1,8); |
| 1056 | &pslldq ($T3,8); # |
| 1057 | &pxor ($Xhn,$T1); |
| 1058 | &pxor ($Xn,$T3); # |
| 1059 | &movdqa ($T3,&QWP(0,$const)); |
| 1060 | |
| 1061 | &lea ($inp,&DWP(32,$inp)); |
| 1062 | &sub ($len,0x20); |
| 1063 | &ja (&label("mod_loop")); |
| 1064 | |
| 1065 | &set_label("even_tail"); |
| 1066 | &clmul64x64_T2 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi) |
| 1067 | |
| 1068 | &pxor ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi) |
| 1069 | &pxor ($Xhi,$Xhn); |
| 1070 | |
| 1071 | &reduction_alg9 ($Xhi,$Xi); |
| 1072 | |
| 1073 | &test ($len,$len); |
| 1074 | &jnz (&label("done")); |
| 1075 | |
| 1076 | &movups ($Hkey,&QWP(0,$Htbl)); # load H |
| 1077 | &set_label("odd_tail"); |
| 1078 | &movdqu ($T1,&QWP(0,$inp)); # Ii |
| 1079 | &pshufb ($T1,$T3); |
| 1080 | &pxor ($Xi,$T1); # Ii+Xi |
| 1081 | |
| 1082 | &clmul64x64_T2 ($Xhi,$Xi,$Hkey); # H*(Ii+Xi) |
| 1083 | &reduction_alg9 ($Xhi,$Xi); |
| 1084 | |
| 1085 | &set_label("done"); |
| 1086 | &pshufb ($Xi,$T3); |
| 1087 | &movdqu (&QWP(0,$Xip),$Xi); |
| 1088 | &function_end("gcm_ghash_clmul"); |
| 1089 | |
| 1090 | } else { # Algorith 5. Kept for reference purposes. |
| 1091 | |
| 1092 | sub reduction_alg5 { # 19/16 times faster than Intel version |
| 1093 | my ($Xhi,$Xi)=@_; |
| 1094 | |
| 1095 | # <<1 |
| 1096 | &movdqa ($T1,$Xi); # |
| 1097 | &movdqa ($T2,$Xhi); |
| 1098 | &pslld ($Xi,1); |
| 1099 | &pslld ($Xhi,1); # |
| 1100 | &psrld ($T1,31); |
| 1101 | &psrld ($T2,31); # |
| 1102 | &movdqa ($T3,$T1); |
| 1103 | &pslldq ($T1,4); |
| 1104 | &psrldq ($T3,12); # |
| 1105 | &pslldq ($T2,4); |
| 1106 | &por ($Xhi,$T3); # |
| 1107 | &por ($Xi,$T1); |
| 1108 | &por ($Xhi,$T2); # |
| 1109 | |
| 1110 | # 1st phase |
| 1111 | &movdqa ($T1,$Xi); |
| 1112 | &movdqa ($T2,$Xi); |
| 1113 | &movdqa ($T3,$Xi); # |
| 1114 | &pslld ($T1,31); |
| 1115 | &pslld ($T2,30); |
| 1116 | &pslld ($Xi,25); # |
| 1117 | &pxor ($T1,$T2); |
| 1118 | &pxor ($T1,$Xi); # |
| 1119 | &movdqa ($T2,$T1); # |
| 1120 | &pslldq ($T1,12); |
| 1121 | &psrldq ($T2,4); # |
| 1122 | &pxor ($T3,$T1); |
| 1123 | |
| 1124 | # 2nd phase |
| 1125 | &pxor ($Xhi,$T3); # |
| 1126 | &movdqa ($Xi,$T3); |
| 1127 | &movdqa ($T1,$T3); |
| 1128 | &psrld ($Xi,1); # |
| 1129 | &psrld ($T1,2); |
| 1130 | &psrld ($T3,7); # |
| 1131 | &pxor ($Xi,$T1); |
| 1132 | &pxor ($Xhi,$T2); |
| 1133 | &pxor ($Xi,$T3); # |
| 1134 | &pxor ($Xi,$Xhi); # |
| 1135 | } |
| 1136 | |
| 1137 | &function_begin_B("gcm_init_clmul"); |
| 1138 | &mov ($Htbl,&wparam(0)); |
| 1139 | &mov ($Xip,&wparam(1)); |
| 1140 | |
| 1141 | &call (&label("pic")); |
| 1142 | &set_label("pic"); |
| 1143 | &blindpop ($const); |
| 1144 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); |
| 1145 | |
| 1146 | &movdqu ($Hkey,&QWP(0,$Xip)); |
| 1147 | &pshufd ($Hkey,$Hkey,0b01001110);# dword swap |
| 1148 | |
| 1149 | # calculate H^2 |
| 1150 | &movdqa ($Xi,$Hkey); |
| 1151 | &clmul64x64_T3 ($Xhi,$Xi,$Hkey); |
| 1152 | &reduction_alg5 ($Xhi,$Xi); |
| 1153 | |
| 1154 | &movdqu (&QWP(0,$Htbl),$Hkey); # save H |
| 1155 | &movdqu (&QWP(16,$Htbl),$Xi); # save H^2 |
| 1156 | |
| 1157 | &ret (); |
| 1158 | &function_end_B("gcm_init_clmul"); |
| 1159 | |
| 1160 | &function_begin_B("gcm_gmult_clmul"); |
| 1161 | &mov ($Xip,&wparam(0)); |
| 1162 | &mov ($Htbl,&wparam(1)); |
| 1163 | |
| 1164 | &call (&label("pic")); |
| 1165 | &set_label("pic"); |
| 1166 | &blindpop ($const); |
| 1167 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); |
| 1168 | |
| 1169 | &movdqu ($Xi,&QWP(0,$Xip)); |
| 1170 | &movdqa ($Xn,&QWP(0,$const)); |
| 1171 | &movdqu ($Hkey,&QWP(0,$Htbl)); |
| 1172 | &pshufb ($Xi,$Xn); |
| 1173 | |
| 1174 | &clmul64x64_T3 ($Xhi,$Xi,$Hkey); |
| 1175 | &reduction_alg5 ($Xhi,$Xi); |
| 1176 | |
| 1177 | &pshufb ($Xi,$Xn); |
| 1178 | &movdqu (&QWP(0,$Xip),$Xi); |
| 1179 | |
| 1180 | &ret (); |
| 1181 | &function_end_B("gcm_gmult_clmul"); |
| 1182 | |
| 1183 | &function_begin("gcm_ghash_clmul"); |
| 1184 | &mov ($Xip,&wparam(0)); |
| 1185 | &mov ($Htbl,&wparam(1)); |
| 1186 | &mov ($inp,&wparam(2)); |
| 1187 | &mov ($len,&wparam(3)); |
| 1188 | |
| 1189 | &call (&label("pic")); |
| 1190 | &set_label("pic"); |
| 1191 | &blindpop ($const); |
| 1192 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); |
| 1193 | |
| 1194 | &movdqu ($Xi,&QWP(0,$Xip)); |
| 1195 | &movdqa ($T3,&QWP(0,$const)); |
| 1196 | &movdqu ($Hkey,&QWP(0,$Htbl)); |
| 1197 | &pshufb ($Xi,$T3); |
| 1198 | |
| 1199 | &sub ($len,0x10); |
| 1200 | &jz (&label("odd_tail")); |
| 1201 | |
| 1202 | ####### |
| 1203 | # Xi+2 =[H*(Ii+1 + Xi+1)] mod P = |
| 1204 | # [(H*Ii+1) + (H*Xi+1)] mod P = |
| 1205 | # [(H*Ii+1) + H^2*(Ii+Xi)] mod P |
| 1206 | # |
| 1207 | &movdqu ($T1,&QWP(0,$inp)); # Ii |
| 1208 | &movdqu ($Xn,&QWP(16,$inp)); # Ii+1 |
| 1209 | &pshufb ($T1,$T3); |
| 1210 | &pshufb ($Xn,$T3); |
| 1211 | &pxor ($Xi,$T1); # Ii+Xi |
| 1212 | |
| 1213 | &clmul64x64_T3 ($Xhn,$Xn,$Hkey); # H*Ii+1 |
| 1214 | &movdqu ($Hkey,&QWP(16,$Htbl)); # load H^2 |
| 1215 | |
| 1216 | &sub ($len,0x20); |
| 1217 | &lea ($inp,&DWP(32,$inp)); # i+=2 |
| 1218 | &jbe (&label("even_tail")); |
| 1219 | |
| 1220 | &set_label("mod_loop"); |
| 1221 | &clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi) |
| 1222 | &movdqu ($Hkey,&QWP(0,$Htbl)); # load H |
| 1223 | |
| 1224 | &pxor ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi) |
| 1225 | &pxor ($Xhi,$Xhn); |
| 1226 | |
| 1227 | &reduction_alg5 ($Xhi,$Xi); |
| 1228 | |
| 1229 | ####### |
| 1230 | &movdqa ($T3,&QWP(0,$const)); |
| 1231 | &movdqu ($T1,&QWP(0,$inp)); # Ii |
| 1232 | &movdqu ($Xn,&QWP(16,$inp)); # Ii+1 |
| 1233 | &pshufb ($T1,$T3); |
| 1234 | &pshufb ($Xn,$T3); |
| 1235 | &pxor ($Xi,$T1); # Ii+Xi |
| 1236 | |
| 1237 | &clmul64x64_T3 ($Xhn,$Xn,$Hkey); # H*Ii+1 |
| 1238 | &movdqu ($Hkey,&QWP(16,$Htbl)); # load H^2 |
| 1239 | |
| 1240 | &sub ($len,0x20); |
| 1241 | &lea ($inp,&DWP(32,$inp)); |
| 1242 | &ja (&label("mod_loop")); |
| 1243 | |
| 1244 | &set_label("even_tail"); |
| 1245 | &clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi) |
| 1246 | |
| 1247 | &pxor ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi) |
| 1248 | &pxor ($Xhi,$Xhn); |
| 1249 | |
| 1250 | &reduction_alg5 ($Xhi,$Xi); |
| 1251 | |
| 1252 | &movdqa ($T3,&QWP(0,$const)); |
| 1253 | &test ($len,$len); |
| 1254 | &jnz (&label("done")); |
| 1255 | |
| 1256 | &movdqu ($Hkey,&QWP(0,$Htbl)); # load H |
| 1257 | &set_label("odd_tail"); |
| 1258 | &movdqu ($T1,&QWP(0,$inp)); # Ii |
| 1259 | &pshufb ($T1,$T3); |
| 1260 | &pxor ($Xi,$T1); # Ii+Xi |
| 1261 | |
| 1262 | &clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H*(Ii+Xi) |
| 1263 | &reduction_alg5 ($Xhi,$Xi); |
| 1264 | |
| 1265 | &movdqa ($T3,&QWP(0,$const)); |
| 1266 | &set_label("done"); |
| 1267 | &pshufb ($Xi,$T3); |
| 1268 | &movdqu (&QWP(0,$Xip),$Xi); |
| 1269 | &function_end("gcm_ghash_clmul"); |
| 1270 | |
| 1271 | } |
| 1272 | |
| 1273 | &set_label("bswap",64); |
| 1274 | &data_byte(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0); |
| 1275 | &data_byte(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2); # 0x1c2_polynomial |
| 1276 | }} # $sse2 |
| 1277 | |
| 1278 | &set_label("rem_4bit",64); |
| 1279 | &data_word(0,0x0000<<$S,0,0x1C20<<$S,0,0x3840<<$S,0,0x2460<<$S); |
| 1280 | &data_word(0,0x7080<<$S,0,0x6CA0<<$S,0,0x48C0<<$S,0,0x54E0<<$S); |
| 1281 | &data_word(0,0xE100<<$S,0,0xFD20<<$S,0,0xD940<<$S,0,0xC560<<$S); |
| 1282 | &data_word(0,0x9180<<$S,0,0x8DA0<<$S,0,0xA9C0<<$S,0,0xB5E0<<$S); |
| 1283 | &set_label("rem_8bit",64); |
| 1284 | &data_short(0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E); |
| 1285 | &data_short(0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E); |
| 1286 | &data_short(0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E); |
| 1287 | &data_short(0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E); |
| 1288 | &data_short(0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E); |
| 1289 | &data_short(0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E); |
| 1290 | &data_short(0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E); |
| 1291 | &data_short(0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E); |
| 1292 | &data_short(0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE); |
| 1293 | &data_short(0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE); |
| 1294 | &data_short(0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE); |
| 1295 | &data_short(0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE); |
| 1296 | &data_short(0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E); |
| 1297 | &data_short(0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E); |
| 1298 | &data_short(0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE); |
| 1299 | &data_short(0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE); |
| 1300 | &data_short(0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E); |
| 1301 | &data_short(0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E); |
| 1302 | &data_short(0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E); |
| 1303 | &data_short(0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E); |
| 1304 | &data_short(0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E); |
| 1305 | &data_short(0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E); |
| 1306 | &data_short(0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E); |
| 1307 | &data_short(0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E); |
| 1308 | &data_short(0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE); |
| 1309 | &data_short(0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE); |
| 1310 | &data_short(0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE); |
| 1311 | &data_short(0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE); |
| 1312 | &data_short(0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E); |
| 1313 | &data_short(0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E); |
| 1314 | &data_short(0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE); |
| 1315 | &data_short(0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE); |
| 1316 | }}} # !$x86only |
| 1317 | |
| 1318 | &asciz("GHASH for x86, CRYPTOGAMS by <appro\@openssl.org>"); |
| 1319 | &asm_finish(); |
| 1320 | |
| 1321 | # A question was risen about choice of vanilla MMX. Or rather why wasn't |
| 1322 | # SSE2 chosen instead? In addition to the fact that MMX runs on legacy |
| 1323 | # CPUs such as PIII, "4-bit" MMX version was observed to provide better |
| 1324 | # performance than *corresponding* SSE2 one even on contemporary CPUs. |
| 1325 | # SSE2 results were provided by Peter-Michael Hager. He maintains SSE2 |
| 1326 | # implementation featuring full range of lookup-table sizes, but with |
| 1327 | # per-invocation lookup table setup. Latter means that table size is |
| 1328 | # chosen depending on how much data is to be hashed in every given call, |
| 1329 | # more data - larger table. Best reported result for Core2 is ~4 cycles |
| 1330 | # per processed byte out of 64KB block. This number accounts even for |
| 1331 | # 64KB table setup overhead. As discussed in gcm128.c we choose to be |
| 1332 | # more conservative in respect to lookup table sizes, but how do the |
| 1333 | # results compare? Minimalistic "256B" MMX version delivers ~11 cycles |
| 1334 | # on same platform. As also discussed in gcm128.c, next in line "8-bit |
| 1335 | # Shoup's" or "4KB" method should deliver twice the performance of |
| 1336 | # "256B" one, in other words not worse than ~6 cycles per byte. It |
| 1337 | # should be also be noted that in SSE2 case improvement can be "super- |
| 1338 | # linear," i.e. more than twice, mostly because >>8 maps to single |
| 1339 | # instruction on SSE2 register. This is unlike "4-bit" case when >>4 |
| 1340 | # maps to same amount of instructions in both MMX and SSE2 cases. |
| 1341 | # Bottom line is that switch to SSE2 is considered to be justifiable |
| 1342 | # only in case we choose to implement "8-bit" method... |