Blame - jni/openssl/crypto/modes/asm/ghash-x86.pl - jami-client-android

blob: 6b09669d474abbc55c8643b4460f0b6de770eab6 [file] [log] [blame]

Alexandre Savard	1b09e31	2012-08-07 20:33:29 -0400	[diff] [blame]	1	#!/usr/bin/env perl
				2	#
				3	# ====================================================================
				4	# Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
				5	# project. The module is, however, dual licensed under OpenSSL and
				6	# CRYPTOGAMS licenses depending on where you obtain it. For further
				7	# details see http://www.openssl.org/~appro/cryptogams/.
				8	# ====================================================================
				9	#
				10	# March, May, June 2010
				11	#
				12	# The module implements "4-bit" GCM GHASH function and underlying
				13	# single multiplication operation in GF(2^128). "4-bit" means that it
				14	# uses 256 bytes per-key table [+64/128 bytes fixed table]. It has two
				15	# code paths: vanilla x86 and vanilla MMX. Former will be executed on
				16	# 486 and Pentium, latter on all others. MMX GHASH features so called
				17	# "528B" variant of "4-bit" method utilizing additional 256+16 bytes
				18	# of per-key storage [+512 bytes shared table]. Performance results
				19	# are for streamed GHASH subroutine and are expressed in cycles per
				20	# processed byte, less is better:
				21	#
				22	# gcc 2.95.3(*) MMX assembler x86 assembler
				23	#
				24	# Pentium 105/111(**) - 50
				25	# PIII 68 /75 12.2 24
				26	# P4 125/125 17.8 84(***)
				27	# Opteron 66 /70 10.1 30
				28	# Core2 54 /67 8.4 18
				29	#
				30	# (*) gcc 3.4.x was observed to generate few percent slower code,
				31	# which is one of reasons why 2.95.3 results were chosen,
				32	# another reason is lack of 3.4.x results for older CPUs;
				33	# comparison with MMX results is not completely fair, because C
				34	# results are for vanilla "256B" implementation, while
				35	# assembler results are for "528B";-)
				36	# (**) second number is result for code compiled with -fPIC flag,
				37	# which is actually more relevant, because assembler code is
				38	# position-independent;
				39	# (***) see comment in non-MMX routine for further details;
				40	#
				41	# To summarize, it's >2-5 times faster than gcc-generated code. To
				42	# anchor it to something else SHA1 assembler processes one byte in
				43	# 11-13 cycles on contemporary x86 cores. As for choice of MMX in
				44	# particular, see comment at the end of the file...
				45
				46	# May 2010
				47	#
				48	# Add PCLMULQDQ version performing at 2.10 cycles per processed byte.
				49	# The question is how close is it to theoretical limit? The pclmulqdq
				50	# instruction latency appears to be 14 cycles and there can't be more
				51	# than 2 of them executing at any given time. This means that single
				52	# Karatsuba multiplication would take 28 cycles plus few cycles for
				53	# pre- and post-processing. Then multiplication has to be followed by
				54	# modulo-reduction. Given that aggregated reduction method [see
				55	# "Carry-less Multiplication and Its Usage for Computing the GCM Mode"
				56	# white paper by Intel] allows you to perform reduction only once in
				57	# a while we can assume that asymptotic performance can be estimated
				58	# as (28+Tmod/Naggr)/16, where Tmod is time to perform reduction
				59	# and Naggr is the aggregation factor.
				60	#
				61	# Before we proceed to this implementation let's have closer look at
				62	# the best-performing code suggested by Intel in their white paper.
				63	# By tracing inter-register dependencies Tmod is estimated as ~19
				64	# cycles and Naggr chosen by Intel is 4, resulting in 2.05 cycles per
				65	# processed byte. As implied, this is quite optimistic estimate,
				66	# because it does not account for Karatsuba pre- and post-processing,
				67	# which for a single multiplication is ~5 cycles. Unfortunately Intel
				68	# does not provide performance data for GHASH alone. But benchmarking
				69	# AES_GCM_encrypt ripped out of Fig. 15 of the white paper with aadt
				70	# alone resulted in 2.46 cycles per byte of out 16KB buffer. Note that
				71	# the result accounts even for pre-computing of degrees of the hash
				72	# key H, but its portion is negligible at 16KB buffer size.
				73	#
				74	# Moving on to the implementation in question. Tmod is estimated as
				75	# ~13 cycles and Naggr is 2, giving asymptotic performance of ...
				76	# 2.16. How is it possible that measured performance is better than
				77	# optimistic theoretical estimate? There is one thing Intel failed
				78	# to recognize. By serializing GHASH with CTR in same subroutine
				79	# former's performance is really limited to above (Tmul + Tmod/Naggr)
				80	# equation. But if GHASH procedure is detached, the modulo-reduction
				81	# can be interleaved with Naggr-1 multiplications at instruction level
				82	# and under ideal conditions even disappear from the equation. So that
				83	# optimistic theoretical estimate for this implementation is ...
				84	# 28/16=1.75, and not 2.16. Well, it's probably way too optimistic,
				85	# at least for such small Naggr. I'd argue that (28+Tproc/Naggr),
				86	# where Tproc is time required for Karatsuba pre- and post-processing,
				87	# is more realistic estimate. In this case it gives ... 1.91 cycles.
				88	# Or in other words, depending on how well we can interleave reduction
				89	# and one of the two multiplications the performance should be betwen
				90	# 1.91 and 2.16. As already mentioned, this implementation processes
				91	# one byte out of 8KB buffer in 2.10 cycles, while x86_64 counterpart
				92	# - in 2.02. x86_64 performance is better, because larger register
				93	# bank allows to interleave reduction and multiplication better.
				94	#
				95	# Does it make sense to increase Naggr? To start with it's virtually
				96	# impossible in 32-bit mode, because of limited register bank
				97	# capacity. Otherwise improvement has to be weighed agiainst slower
				98	# setup, as well as code size and complexity increase. As even
				99	# optimistic estimate doesn't promise 30% performance improvement,
				100	# there are currently no plans to increase Naggr.
				101	#
				102	# Special thanks to David Woodhouse <dwmw2@infradead.org> for
				103	# providing access to a Westmere-based system on behalf of Intel
				104	# Open Source Technology Centre.
				105
				106	# January 2010
				107	#
				108	# Tweaked to optimize transitions between integer and FP operations
				109	# on same XMM register, PCLMULQDQ subroutine was measured to process
				110	# one byte in 2.07 cycles on Sandy Bridge, and in 2.12 - on Westmere.
				111	# The minor regression on Westmere is outweighed by ~15% improvement
				112	# on Sandy Bridge. Strangely enough attempt to modify 64-bit code in
				113	# similar manner resulted in almost 20% degradation on Sandy Bridge,
				114	# where original 64-bit code processes one byte in 1.95 cycles.
				115
				116	$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
				117	push(@INC,"${dir}","${dir}../../perlasm");
				118	require "x86asm.pl";
				119
				120	&asm_init($ARGV[0],"ghash-x86.pl",$x86only = $ARGV[$#ARGV] eq "386");
				121
				122	$sse2=0;
				123	for (@ARGV) { $sse2=1 if (/-DOPENSSL_IA32_SSE2/); }
				124
				125	($Zhh,$Zhl,$Zlh,$Zll) = ("ebp","edx","ecx","ebx");
				126	$inp = "edi";
				127	$Htbl = "esi";
				128
				129	$unroll = 0; # Affects x86 loop. Folded loop performs ~7% worse
				130	# than unrolled, which has to be weighted against
				131	# 2.5x x86-specific code size reduction.
				132
				133	sub x86_loop {
				134	my $off = shift;
				135	my $rem = "eax";
				136
				137	&mov ($Zhh,&DWP(4,$Htbl,$Zll));
				138	&mov ($Zhl,&DWP(0,$Htbl,$Zll));
				139	&mov ($Zlh,&DWP(12,$Htbl,$Zll));
				140	&mov ($Zll,&DWP(8,$Htbl,$Zll));
				141	&xor ($rem,$rem); # avoid partial register stalls on PIII
				142
				143	# shrd practically kills P4, 2.5x deterioration, but P4 has
				144	# MMX code-path to execute. shrd runs tad faster [than twice
				145	# the shifts, move's and or's] on pre-MMX Pentium (as well as
				146	# PIII and Core2), but minimizes code size, spares register
				147	# and thus allows to fold the loop...
				148	if (!$unroll) {
				149	my $cnt = $inp;
				150	&mov ($cnt,15);
				151	&jmp (&label("x86_loop"));
				152	&set_label("x86_loop",16);
				153	for($i=1;$i<=2;$i++) {
				154	&mov (&LB($rem),&LB($Zll));
				155	&shrd ($Zll,$Zlh,4);
				156	&and (&LB($rem),0xf);
				157	&shrd ($Zlh,$Zhl,4);
				158	&shrd ($Zhl,$Zhh,4);
				159	&shr ($Zhh,4);
				160	&xor ($Zhh,&DWP($off+16,"esp",$rem,4));
				161
				162	&mov (&LB($rem),&BP($off,"esp",$cnt));
				163	if ($i&1) {
				164	&and (&LB($rem),0xf0);
				165	} else {
				166	&shl (&LB($rem),4);
				167	}
				168
				169	&xor ($Zll,&DWP(8,$Htbl,$rem));
				170	&xor ($Zlh,&DWP(12,$Htbl,$rem));
				171	&xor ($Zhl,&DWP(0,$Htbl,$rem));
				172	&xor ($Zhh,&DWP(4,$Htbl,$rem));
				173
				174	if ($i&1) {
				175	&dec ($cnt);
				176	&js (&label("x86_break"));
				177	} else {
				178	&jmp (&label("x86_loop"));
				179	}
				180	}
				181	&set_label("x86_break",16);
				182	} else {
				183	for($i=1;$i<32;$i++) {
				184	&comment($i);
				185	&mov (&LB($rem),&LB($Zll));
				186	&shrd ($Zll,$Zlh,4);
				187	&and (&LB($rem),0xf);
				188	&shrd ($Zlh,$Zhl,4);
				189	&shrd ($Zhl,$Zhh,4);
				190	&shr ($Zhh,4);
				191	&xor ($Zhh,&DWP($off+16,"esp",$rem,4));
				192
				193	if ($i&1) {
				194	&mov (&LB($rem),&BP($off+15-($i>>1),"esp"));
				195	&and (&LB($rem),0xf0);
				196	} else {
				197	&mov (&LB($rem),&BP($off+15-($i>>1),"esp"));
				198	&shl (&LB($rem),4);
				199	}
				200
				201	&xor ($Zll,&DWP(8,$Htbl,$rem));
				202	&xor ($Zlh,&DWP(12,$Htbl,$rem));
				203	&xor ($Zhl,&DWP(0,$Htbl,$rem));
				204	&xor ($Zhh,&DWP(4,$Htbl,$rem));
				205	}
				206	}
				207	&bswap ($Zll);
				208	&bswap ($Zlh);
				209	&bswap ($Zhl);
				210	if (!$x86only) {
				211	&bswap ($Zhh);
				212	} else {
				213	&mov ("eax",$Zhh);
				214	&bswap ("eax");
				215	&mov ($Zhh,"eax");
				216	}
				217	}
				218
				219	if ($unroll) {
				220	&function_begin_B("_x86_gmult_4bit_inner");
				221	&x86_loop(4);
				222	&ret ();
				223	&function_end_B("_x86_gmult_4bit_inner");
				224	}
				225
				226	sub deposit_rem_4bit {
				227	my $bias = shift;
				228
				229	&mov (&DWP($bias+0, "esp"),0x0000<<16);
				230	&mov (&DWP($bias+4, "esp"),0x1C20<<16);
				231	&mov (&DWP($bias+8, "esp"),0x3840<<16);
				232	&mov (&DWP($bias+12,"esp"),0x2460<<16);
				233	&mov (&DWP($bias+16,"esp"),0x7080<<16);
				234	&mov (&DWP($bias+20,"esp"),0x6CA0<<16);
				235	&mov (&DWP($bias+24,"esp"),0x48C0<<16);
				236	&mov (&DWP($bias+28,"esp"),0x54E0<<16);
				237	&mov (&DWP($bias+32,"esp"),0xE100<<16);
				238	&mov (&DWP($bias+36,"esp"),0xFD20<<16);
				239	&mov (&DWP($bias+40,"esp"),0xD940<<16);
				240	&mov (&DWP($bias+44,"esp"),0xC560<<16);
				241	&mov (&DWP($bias+48,"esp"),0x9180<<16);
				242	&mov (&DWP($bias+52,"esp"),0x8DA0<<16);
				243	&mov (&DWP($bias+56,"esp"),0xA9C0<<16);
				244	&mov (&DWP($bias+60,"esp"),0xB5E0<<16);
				245	}
				246
				247	$suffix = $x86only ? "" : "_x86";
				248
				249	&function_begin("gcm_gmult_4bit".$suffix);
				250	&stack_push(16+4+1); # +1 for stack alignment
				251	&mov ($inp,&wparam(0)); # load Xi
				252	&mov ($Htbl,&wparam(1)); # load Htable
				253
				254	&mov ($Zhh,&DWP(0,$inp)); # load Xi[16]
				255	&mov ($Zhl,&DWP(4,$inp));
				256	&mov ($Zlh,&DWP(8,$inp));
				257	&mov ($Zll,&DWP(12,$inp));
				258
				259	&deposit_rem_4bit(16);
				260
				261	&mov (&DWP(0,"esp"),$Zhh); # copy Xi[16] on stack
				262	&mov (&DWP(4,"esp"),$Zhl);
				263	&mov (&DWP(8,"esp"),$Zlh);
				264	&mov (&DWP(12,"esp"),$Zll);
				265	&shr ($Zll,20);
				266	&and ($Zll,0xf0);
				267
				268	if ($unroll) {
				269	&call ("_x86_gmult_4bit_inner");
				270	} else {
				271	&x86_loop(0);
				272	&mov ($inp,&wparam(0));
				273	}
				274
				275	&mov (&DWP(12,$inp),$Zll);
				276	&mov (&DWP(8,$inp),$Zlh);
				277	&mov (&DWP(4,$inp),$Zhl);
				278	&mov (&DWP(0,$inp),$Zhh);
				279	&stack_pop(16+4+1);
				280	&function_end("gcm_gmult_4bit".$suffix);
				281
				282	&function_begin("gcm_ghash_4bit".$suffix);
				283	&stack_push(16+4+1); # +1 for 64-bit alignment
				284	&mov ($Zll,&wparam(0)); # load Xi
				285	&mov ($Htbl,&wparam(1)); # load Htable
				286	&mov ($inp,&wparam(2)); # load in
				287	&mov ("ecx",&wparam(3)); # load len
				288	&add ("ecx",$inp);
				289	&mov (&wparam(3),"ecx");
				290
				291	&mov ($Zhh,&DWP(0,$Zll)); # load Xi[16]
				292	&mov ($Zhl,&DWP(4,$Zll));
				293	&mov ($Zlh,&DWP(8,$Zll));
				294	&mov ($Zll,&DWP(12,$Zll));
				295
				296	&deposit_rem_4bit(16);
				297
				298	&set_label("x86_outer_loop",16);
				299	&xor ($Zll,&DWP(12,$inp)); # xor with input
				300	&xor ($Zlh,&DWP(8,$inp));
				301	&xor ($Zhl,&DWP(4,$inp));
				302	&xor ($Zhh,&DWP(0,$inp));
				303	&mov (&DWP(12,"esp"),$Zll); # dump it on stack
				304	&mov (&DWP(8,"esp"),$Zlh);
				305	&mov (&DWP(4,"esp"),$Zhl);
				306	&mov (&DWP(0,"esp"),$Zhh);
				307
				308	&shr ($Zll,20);
				309	&and ($Zll,0xf0);
				310
				311	if ($unroll) {
				312	&call ("_x86_gmult_4bit_inner");
				313	} else {
				314	&x86_loop(0);
				315	&mov ($inp,&wparam(2));
				316	}
				317	&lea ($inp,&DWP(16,$inp));
				318	&cmp ($inp,&wparam(3));
				319	&mov (&wparam(2),$inp) if (!$unroll);
				320	&jb (&label("x86_outer_loop"));
				321
				322	&mov ($inp,&wparam(0)); # load Xi
				323	&mov (&DWP(12,$inp),$Zll);
				324	&mov (&DWP(8,$inp),$Zlh);
				325	&mov (&DWP(4,$inp),$Zhl);
				326	&mov (&DWP(0,$inp),$Zhh);
				327	&stack_pop(16+4+1);
				328	&function_end("gcm_ghash_4bit".$suffix);
				329
				330	if (!$x86only) {{{
				331
				332	&static_label("rem_4bit");
				333
				334	if (!$sse2) {{ # pure-MMX "May" version...
				335
				336	$S=12; # shift factor for rem_4bit
				337
				338	&function_begin_B("_mmx_gmult_4bit_inner");
				339	# MMX version performs 3.5 times better on P4 (see comment in non-MMX
				340	# routine for further details), 100% better on Opteron, ~70% better
				341	# on Core2 and PIII... In other words effort is considered to be well
				342	# spent... Since initial release the loop was unrolled in order to
				343	# "liberate" register previously used as loop counter. Instead it's
				344	# used to optimize critical path in 'Z.hi ^= rem_4bit[Z.lo&0xf]'.
				345	# The path involves move of Z.lo from MMX to integer register,
				346	# effective address calculation and finally merge of value to Z.hi.
				347	# Reference to rem_4bit is scheduled so late that I had to >>4
				348	# rem_4bit elements. This resulted in 20-45% procent improvement
				349	# on contemporary µ-archs.
				350	{
				351	my $cnt;
				352	my $rem_4bit = "eax";
				353	my @rem = ($Zhh,$Zll);
				354	my $nhi = $Zhl;
				355	my $nlo = $Zlh;
				356
				357	my ($Zlo,$Zhi) = ("mm0","mm1");
				358	my $tmp = "mm2";
				359
				360	&xor ($nlo,$nlo); # avoid partial register stalls on PIII
				361	&mov ($nhi,$Zll);
				362	&mov (&LB($nlo),&LB($nhi));
				363	&shl (&LB($nlo),4);
				364	&and ($nhi,0xf0);
				365	&movq ($Zlo,&QWP(8,$Htbl,$nlo));
				366	&movq ($Zhi,&QWP(0,$Htbl,$nlo));
				367	&movd ($rem[0],$Zlo);
				368
				369	for ($cnt=28;$cnt>=-2;$cnt--) {
				370	my $odd = $cnt&1;
				371	my $nix = $odd ? $nlo : $nhi;
				372
				373	&shl (&LB($nlo),4) if ($odd);
				374	&psrlq ($Zlo,4);
				375	&movq ($tmp,$Zhi);
				376	&psrlq ($Zhi,4);
				377	&pxor ($Zlo,&QWP(8,$Htbl,$nix));
				378	&mov (&LB($nlo),&BP($cnt/2,$inp)) if (!$odd && $cnt>=0);
				379	&psllq ($tmp,60);
				380	&and ($nhi,0xf0) if ($odd);
				381	&pxor ($Zhi,&QWP(0,$rem_4bit,$rem[1],8)) if ($cnt<28);
				382	&and ($rem[0],0xf);
				383	&pxor ($Zhi,&QWP(0,$Htbl,$nix));
				384	&mov ($nhi,$nlo) if (!$odd && $cnt>=0);
				385	&movd ($rem[1],$Zlo);
				386	&pxor ($Zlo,$tmp);
				387
				388	push (@rem,shift(@rem)); # "rotate" registers
				389	}
				390
				391	&mov ($inp,&DWP(4,$rem_4bit,$rem[1],8)); # last rem_4bit[rem]
				392
				393	&psrlq ($Zlo,32); # lower part of Zlo is already there
				394	&movd ($Zhl,$Zhi);
				395	&psrlq ($Zhi,32);
				396	&movd ($Zlh,$Zlo);
				397	&movd ($Zhh,$Zhi);
				398	&shl ($inp,4); # compensate for rem_4bit[i] being >>4
				399
				400	&bswap ($Zll);
				401	&bswap ($Zhl);
				402	&bswap ($Zlh);
				403	&xor ($Zhh,$inp);
				404	&bswap ($Zhh);
				405
				406	&ret ();
				407	}
				408	&function_end_B("_mmx_gmult_4bit_inner");
				409
				410	&function_begin("gcm_gmult_4bit_mmx");
				411	&mov ($inp,&wparam(0)); # load Xi
				412	&mov ($Htbl,&wparam(1)); # load Htable
				413
				414	&call (&label("pic_point"));
				415	&set_label("pic_point");
				416	&blindpop("eax");
				417	&lea ("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax"));
				418
				419	&movz ($Zll,&BP(15,$inp));
				420
				421	&call ("_mmx_gmult_4bit_inner");
				422
				423	&mov ($inp,&wparam(0)); # load Xi
				424	&emms ();
				425	&mov (&DWP(12,$inp),$Zll);
				426	&mov (&DWP(4,$inp),$Zhl);
				427	&mov (&DWP(8,$inp),$Zlh);
				428	&mov (&DWP(0,$inp),$Zhh);
				429	&function_end("gcm_gmult_4bit_mmx");
				430
				431	# Streamed version performs 20% better on P4, 7% on Opteron,
				432	# 10% on Core2 and PIII...
				433	&function_begin("gcm_ghash_4bit_mmx");
				434	&mov ($Zhh,&wparam(0)); # load Xi
				435	&mov ($Htbl,&wparam(1)); # load Htable
				436	&mov ($inp,&wparam(2)); # load in
				437	&mov ($Zlh,&wparam(3)); # load len
				438
				439	&call (&label("pic_point"));
				440	&set_label("pic_point");
				441	&blindpop("eax");
				442	&lea ("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax"));
				443
				444	&add ($Zlh,$inp);
				445	&mov (&wparam(3),$Zlh); # len to point at the end of input
				446	&stack_push(4+1); # +1 for stack alignment
				447
				448	&mov ($Zll,&DWP(12,$Zhh)); # load Xi[16]
				449	&mov ($Zhl,&DWP(4,$Zhh));
				450	&mov ($Zlh,&DWP(8,$Zhh));
				451	&mov ($Zhh,&DWP(0,$Zhh));
				452	&jmp (&label("mmx_outer_loop"));
				453
				454	&set_label("mmx_outer_loop",16);
				455	&xor ($Zll,&DWP(12,$inp));
				456	&xor ($Zhl,&DWP(4,$inp));
				457	&xor ($Zlh,&DWP(8,$inp));
				458	&xor ($Zhh,&DWP(0,$inp));
				459	&mov (&wparam(2),$inp);
				460	&mov (&DWP(12,"esp"),$Zll);
				461	&mov (&DWP(4,"esp"),$Zhl);
				462	&mov (&DWP(8,"esp"),$Zlh);
				463	&mov (&DWP(0,"esp"),$Zhh);
				464
				465	&mov ($inp,"esp");
				466	&shr ($Zll,24);
				467
				468	&call ("_mmx_gmult_4bit_inner");
				469
				470	&mov ($inp,&wparam(2));
				471	&lea ($inp,&DWP(16,$inp));
				472	&cmp ($inp,&wparam(3));
				473	&jb (&label("mmx_outer_loop"));
				474
				475	&mov ($inp,&wparam(0)); # load Xi
				476	&emms ();
				477	&mov (&DWP(12,$inp),$Zll);
				478	&mov (&DWP(4,$inp),$Zhl);
				479	&mov (&DWP(8,$inp),$Zlh);
				480	&mov (&DWP(0,$inp),$Zhh);
				481
				482	&stack_pop(4+1);
				483	&function_end("gcm_ghash_4bit_mmx");
				484
				485	}} else {{ # "June" MMX version...
				486	# ... has slower "April" gcm_gmult_4bit_mmx with folded
				487	# loop. This is done to conserve code size...
				488	$S=16; # shift factor for rem_4bit
				489
				490	sub mmx_loop() {
				491	# MMX version performs 2.8 times better on P4 (see comment in non-MMX
				492	# routine for further details), 40% better on Opteron and Core2, 50%
				493	# better on PIII... In other words effort is considered to be well
				494	# spent...
				495	my $inp = shift;
				496	my $rem_4bit = shift;
				497	my $cnt = $Zhh;
				498	my $nhi = $Zhl;
				499	my $nlo = $Zlh;
				500	my $rem = $Zll;
				501
				502	my ($Zlo,$Zhi) = ("mm0","mm1");
				503	my $tmp = "mm2";
				504
				505	&xor ($nlo,$nlo); # avoid partial register stalls on PIII
				506	&mov ($nhi,$Zll);
				507	&mov (&LB($nlo),&LB($nhi));
				508	&mov ($cnt,14);
				509	&shl (&LB($nlo),4);
				510	&and ($nhi,0xf0);
				511	&movq ($Zlo,&QWP(8,$Htbl,$nlo));
				512	&movq ($Zhi,&QWP(0,$Htbl,$nlo));
				513	&movd ($rem,$Zlo);
				514	&jmp (&label("mmx_loop"));
				515
				516	&set_label("mmx_loop",16);
				517	&psrlq ($Zlo,4);
				518	&and ($rem,0xf);
				519	&movq ($tmp,$Zhi);
				520	&psrlq ($Zhi,4);
				521	&pxor ($Zlo,&QWP(8,$Htbl,$nhi));
				522	&mov (&LB($nlo),&BP(0,$inp,$cnt));
				523	&psllq ($tmp,60);
				524	&pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8));
				525	&dec ($cnt);
				526	&movd ($rem,$Zlo);
				527	&pxor ($Zhi,&QWP(0,$Htbl,$nhi));
				528	&mov ($nhi,$nlo);
				529	&pxor ($Zlo,$tmp);
				530	&js (&label("mmx_break"));
				531
				532	&shl (&LB($nlo),4);
				533	&and ($rem,0xf);
				534	&psrlq ($Zlo,4);
				535	&and ($nhi,0xf0);
				536	&movq ($tmp,$Zhi);
				537	&psrlq ($Zhi,4);
				538	&pxor ($Zlo,&QWP(8,$Htbl,$nlo));
				539	&psllq ($tmp,60);
				540	&pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8));
				541	&movd ($rem,$Zlo);
				542	&pxor ($Zhi,&QWP(0,$Htbl,$nlo));
				543	&pxor ($Zlo,$tmp);
				544	&jmp (&label("mmx_loop"));
				545
				546	&set_label("mmx_break",16);
				547	&shl (&LB($nlo),4);
				548	&and ($rem,0xf);
				549	&psrlq ($Zlo,4);
				550	&and ($nhi,0xf0);
				551	&movq ($tmp,$Zhi);
				552	&psrlq ($Zhi,4);
				553	&pxor ($Zlo,&QWP(8,$Htbl,$nlo));
				554	&psllq ($tmp,60);
				555	&pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8));
				556	&movd ($rem,$Zlo);
				557	&pxor ($Zhi,&QWP(0,$Htbl,$nlo));
				558	&pxor ($Zlo,$tmp);
				559
				560	&psrlq ($Zlo,4);
				561	&and ($rem,0xf);
				562	&movq ($tmp,$Zhi);
				563	&psrlq ($Zhi,4);
				564	&pxor ($Zlo,&QWP(8,$Htbl,$nhi));
				565	&psllq ($tmp,60);
				566	&pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8));
				567	&movd ($rem,$Zlo);
				568	&pxor ($Zhi,&QWP(0,$Htbl,$nhi));
				569	&pxor ($Zlo,$tmp);
				570
				571	&psrlq ($Zlo,32); # lower part of Zlo is already there
				572	&movd ($Zhl,$Zhi);
				573	&psrlq ($Zhi,32);
				574	&movd ($Zlh,$Zlo);
				575	&movd ($Zhh,$Zhi);
				576
				577	&bswap ($Zll);
				578	&bswap ($Zhl);
				579	&bswap ($Zlh);
				580	&bswap ($Zhh);
				581	}
				582
				583	&function_begin("gcm_gmult_4bit_mmx");
				584	&mov ($inp,&wparam(0)); # load Xi
				585	&mov ($Htbl,&wparam(1)); # load Htable
				586
				587	&call (&label("pic_point"));
				588	&set_label("pic_point");
				589	&blindpop("eax");
				590	&lea ("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax"));
				591
				592	&movz ($Zll,&BP(15,$inp));
				593
				594	&mmx_loop($inp,"eax");
				595
				596	&emms ();
				597	&mov (&DWP(12,$inp),$Zll);
				598	&mov (&DWP(4,$inp),$Zhl);
				599	&mov (&DWP(8,$inp),$Zlh);
				600	&mov (&DWP(0,$inp),$Zhh);
				601	&function_end("gcm_gmult_4bit_mmx");
				602
				603	######################################################################
				604	# Below subroutine is "528B" variant of "4-bit" GCM GHASH function
				605	# (see gcm128.c for details). It provides further 20-40% performance
				606	# improvement over above mentioned "May" version.
				607
				608	&static_label("rem_8bit");
				609
				610	&function_begin("gcm_ghash_4bit_mmx");
				611	{ my ($Zlo,$Zhi) = ("mm7","mm6");
				612	my $rem_8bit = "esi";
				613	my $Htbl = "ebx";
				614
				615	# parameter block
				616	&mov ("eax",&wparam(0)); # Xi
				617	&mov ("ebx",&wparam(1)); # Htable
				618	&mov ("ecx",&wparam(2)); # inp
				619	&mov ("edx",&wparam(3)); # len
				620	&mov ("ebp","esp"); # original %esp
				621	&call (&label("pic_point"));
				622	&set_label ("pic_point");
				623	&blindpop ($rem_8bit);
				624	&lea ($rem_8bit,&DWP(&label("rem_8bit")."-".&label("pic_point"),$rem_8bit));
				625
				626	&sub ("esp",512+16+16); # allocate stack frame...
				627	&and ("esp",-64); # ...and align it
				628	&sub ("esp",16); # place for (u8)(H[]<<4)
				629
				630	&add ("edx","ecx"); # pointer to the end of input
				631	&mov (&DWP(528+16+0,"esp"),"eax"); # save Xi
				632	&mov (&DWP(528+16+8,"esp"),"edx"); # save inp+len
				633	&mov (&DWP(528+16+12,"esp"),"ebp"); # save original %esp
				634
				635	{ my @lo = ("mm0","mm1","mm2");
				636	my @hi = ("mm3","mm4","mm5");
				637	my @tmp = ("mm6","mm7");
				638	my $off1=0,$off2=0,$i;
				639
				640	&add ($Htbl,128); # optimize for size
				641	&lea ("edi",&DWP(16+128,"esp"));
				642	&lea ("ebp",&DWP(16+256+128,"esp"));
				643
				644	# decompose Htable (low and high parts are kept separately),
				645	# generate Htable[]>>4, (u8)(Htable[]<<4), save to stack...
				646	for ($i=0;$i<18;$i++) {
				647
				648	&mov ("edx",&DWP(16*$i+8-128,$Htbl)) if ($i<16);
				649	&movq ($lo[0],&QWP(16*$i+8-128,$Htbl)) if ($i<16);
				650	&psllq ($tmp[1],60) if ($i>1);
				651	&movq ($hi[0],&QWP(16*$i+0-128,$Htbl)) if ($i<16);
				652	&por ($lo[2],$tmp[1]) if ($i>1);
				653	&movq (&QWP($off1-128,"edi"),$lo[1]) if ($i>0 && $i<17);
				654	&psrlq ($lo[1],4) if ($i>0 && $i<17);
				655	&movq (&QWP($off1,"edi"),$hi[1]) if ($i>0 && $i<17);
				656	&movq ($tmp[0],$hi[1]) if ($i>0 && $i<17);
				657	&movq (&QWP($off2-128,"ebp"),$lo[2]) if ($i>1);
				658	&psrlq ($hi[1],4) if ($i>0 && $i<17);
				659	&movq (&QWP($off2,"ebp"),$hi[2]) if ($i>1);
				660	&shl ("edx",4) if ($i<16);
				661	&mov (&BP($i,"esp"),&LB("edx")) if ($i<16);
				662
				663	unshift (@lo,pop(@lo)); # "rotate" registers
				664	unshift (@hi,pop(@hi));
				665	unshift (@tmp,pop(@tmp));
				666	$off1 += 8 if ($i>0);
				667	$off2 += 8 if ($i>1);
				668	}
				669	}
				670
				671	&movq ($Zhi,&QWP(0,"eax"));
				672	&mov ("ebx",&DWP(8,"eax"));
				673	&mov ("edx",&DWP(12,"eax")); # load Xi
				674
				675	&set_label("outer",16);
				676	{ my $nlo = "eax";
				677	my $dat = "edx";
				678	my @nhi = ("edi","ebp");
				679	my @rem = ("ebx","ecx");
				680	my @red = ("mm0","mm1","mm2");
				681	my $tmp = "mm3";
				682
				683	&xor ($dat,&DWP(12,"ecx")); # merge input data
				684	&xor ("ebx",&DWP(8,"ecx"));
				685	&pxor ($Zhi,&QWP(0,"ecx"));
				686	&lea ("ecx",&DWP(16,"ecx")); # inp+=16
				687	#&mov (&DWP(528+12,"esp"),$dat); # save inp^Xi
				688	&mov (&DWP(528+8,"esp"),"ebx");
				689	&movq (&QWP(528+0,"esp"),$Zhi);
				690	&mov (&DWP(528+16+4,"esp"),"ecx"); # save inp
				691
				692	&xor ($nlo,$nlo);
				693	&rol ($dat,8);
				694	&mov (&LB($nlo),&LB($dat));
				695	&mov ($nhi[1],$nlo);
				696	&and (&LB($nlo),0x0f);
				697	&shr ($nhi[1],4);
				698	&pxor ($red[0],$red[0]);
				699	&rol ($dat,8); # next byte
				700	&pxor ($red[1],$red[1]);
				701	&pxor ($red[2],$red[2]);
				702
				703	# Just like in "May" verson modulo-schedule for critical path in
				704	# 'Z.hi ^= rem_8bit[Z.lo&0xff^((u8)H[nhi]<<4)]<<48'. Final 'pxor'
				705	# is scheduled so late that rem_8bit[] has to be shifted right
				706	# by 16, which is why last argument to pinsrw is 2, which
				707	# corresponds to <<32=<<48>>16...
				708	for ($j=11,$i=0;$i<15;$i++) {
				709
				710	if ($i>0) {
				711	&pxor ($Zlo,&QWP(16,"esp",$nlo,8)); # Z^=H[nlo]
				712	&rol ($dat,8); # next byte
				713	&pxor ($Zhi,&QWP(16+128,"esp",$nlo,8));
				714
				715	&pxor ($Zlo,$tmp);
				716	&pxor ($Zhi,&QWP(16+256+128,"esp",$nhi[0],8));
				717	&xor (&LB($rem[1]),&BP(0,"esp",$nhi[0])); # rem^(H[nhi]<<4)
				718	} else {
				719	&movq ($Zlo,&QWP(16,"esp",$nlo,8));
				720	&movq ($Zhi,&QWP(16+128,"esp",$nlo,8));
				721	}
				722
				723	&mov (&LB($nlo),&LB($dat));
				724	&mov ($dat,&DWP(528+$j,"esp")) if (--$j%4==0);
				725
				726	&movd ($rem[0],$Zlo);
				727	&movz ($rem[1],&LB($rem[1])) if ($i>0);
				728	&psrlq ($Zlo,8); # Z>>=8
				729
				730	&movq ($tmp,$Zhi);
				731	&mov ($nhi[0],$nlo);
				732	&psrlq ($Zhi,8);
				733
				734	&pxor ($Zlo,&QWP(16+256+0,"esp",$nhi[1],8)); # Z^=H[nhi]>>4
				735	&and (&LB($nlo),0x0f);
				736	&psllq ($tmp,56);
				737
				738	&pxor ($Zhi,$red[1]) if ($i>1);
				739	&shr ($nhi[0],4);
				740	&pinsrw ($red[0],&WP(0,$rem_8bit,$rem[1],2),2) if ($i>0);
				741
				742	unshift (@red,pop(@red)); # "rotate" registers
				743	unshift (@rem,pop(@rem));
				744	unshift (@nhi,pop(@nhi));
				745	}
				746
				747	&pxor ($Zlo,&QWP(16,"esp",$nlo,8)); # Z^=H[nlo]
				748	&pxor ($Zhi,&QWP(16+128,"esp",$nlo,8));
				749	&xor (&LB($rem[1]),&BP(0,"esp",$nhi[0])); # rem^(H[nhi]<<4)
				750
				751	&pxor ($Zlo,$tmp);
				752	&pxor ($Zhi,&QWP(16+256+128,"esp",$nhi[0],8));
				753	&movz ($rem[1],&LB($rem[1]));
				754
				755	&pxor ($red[2],$red[2]); # clear 2nd word
				756	&psllq ($red[1],4);
				757
				758	&movd ($rem[0],$Zlo);
				759	&psrlq ($Zlo,4); # Z>>=4
				760
				761	&movq ($tmp,$Zhi);
				762	&psrlq ($Zhi,4);
				763	&shl ($rem[0],4); # rem<<4
				764
				765	&pxor ($Zlo,&QWP(16,"esp",$nhi[1],8)); # Z^=H[nhi]
				766	&psllq ($tmp,60);
				767	&movz ($rem[0],&LB($rem[0]));
				768
				769	&pxor ($Zlo,$tmp);
				770	&pxor ($Zhi,&QWP(16+128,"esp",$nhi[1],8));
				771
				772	&pinsrw ($red[0],&WP(0,$rem_8bit,$rem[1],2),2);
				773	&pxor ($Zhi,$red[1]);
				774
				775	&movd ($dat,$Zlo);
				776	&pinsrw ($red[2],&WP(0,$rem_8bit,$rem[0],2),3); # last is <<48
				777
				778	&psllq ($red[0],12); # correct by <<16>>4
				779	&pxor ($Zhi,$red[0]);
				780	&psrlq ($Zlo,32);
				781	&pxor ($Zhi,$red[2]);
				782
				783	&mov ("ecx",&DWP(528+16+4,"esp")); # restore inp
				784	&movd ("ebx",$Zlo);
				785	&movq ($tmp,$Zhi); # 01234567
				786	&psllw ($Zhi,8); # 1.3.5.7.
				787	&psrlw ($tmp,8); # .0.2.4.6
				788	&por ($Zhi,$tmp); # 10325476
				789	&bswap ($dat);
				790	&pshufw ($Zhi,$Zhi,0b00011011); # 76543210
				791	&bswap ("ebx");
				792
				793	&cmp ("ecx",&DWP(528+16+8,"esp")); # are we done?
				794	&jne (&label("outer"));
				795	}
				796
				797	&mov ("eax",&DWP(528+16+0,"esp")); # restore Xi
				798	&mov (&DWP(12,"eax"),"edx");
				799	&mov (&DWP(8,"eax"),"ebx");
				800	&movq (&QWP(0,"eax"),$Zhi);
				801
				802	&mov ("esp",&DWP(528+16+12,"esp")); # restore original %esp
				803	&emms ();
				804	}
				805	&function_end("gcm_ghash_4bit_mmx");
				806	}}
				807
				808	if ($sse2) {{
				809	######################################################################
				810	# PCLMULQDQ version.
				811
				812	$Xip="eax";
				813	$Htbl="edx";
				814	$const="ecx";
				815	$inp="esi";
				816	$len="ebx";
				817
				818	($Xi,$Xhi)=("xmm0","xmm1"); $Hkey="xmm2";
				819	($T1,$T2,$T3)=("xmm3","xmm4","xmm5");
				820	($Xn,$Xhn)=("xmm6","xmm7");
				821
				822	&static_label("bswap");
				823
				824	sub clmul64x64_T2 { # minimal "register" pressure
				825	my ($Xhi,$Xi,$Hkey)=@_;
				826
				827	&movdqa ($Xhi,$Xi); #
				828	&pshufd ($T1,$Xi,0b01001110);
				829	&pshufd ($T2,$Hkey,0b01001110);
				830	&pxor ($T1,$Xi); #
				831	&pxor ($T2,$Hkey);
				832
				833	&pclmulqdq ($Xi,$Hkey,0x00); #######
				834	&pclmulqdq ($Xhi,$Hkey,0x11); #######
				835	&pclmulqdq ($T1,$T2,0x00); #######
				836	&xorps ($T1,$Xi); #
				837	&xorps ($T1,$Xhi); #
				838
				839	&movdqa ($T2,$T1); #
				840	&psrldq ($T1,8);
				841	&pslldq ($T2,8); #
				842	&pxor ($Xhi,$T1);
				843	&pxor ($Xi,$T2); #
				844	}
				845
				846	sub clmul64x64_T3 {
				847	# Even though this subroutine offers visually better ILP, it
				848	# was empirically found to be a tad slower than above version.
				849	# At least in gcm_ghash_clmul context. But it's just as well,
				850	# because loop modulo-scheduling is possible only thanks to
				851	# minimized "register" pressure...
				852	my ($Xhi,$Xi,$Hkey)=@_;
				853
				854	&movdqa ($T1,$Xi); #
				855	&movdqa ($Xhi,$Xi);
				856	&pclmulqdq ($Xi,$Hkey,0x00); #######
				857	&pclmulqdq ($Xhi,$Hkey,0x11); #######
				858	&pshufd ($T2,$T1,0b01001110); #
				859	&pshufd ($T3,$Hkey,0b01001110);
				860	&pxor ($T2,$T1); #
				861	&pxor ($T3,$Hkey);
				862	&pclmulqdq ($T2,$T3,0x00); #######
				863	&pxor ($T2,$Xi); #
				864	&pxor ($T2,$Xhi); #
				865
				866	&movdqa ($T3,$T2); #
				867	&psrldq ($T2,8);
				868	&pslldq ($T3,8); #
				869	&pxor ($Xhi,$T2);
				870	&pxor ($Xi,$T3); #
				871	}
				872
				873	if (1) { # Algorithm 9 with <<1 twist.
				874	# Reduction is shorter and uses only two
				875	# temporary registers, which makes it better
				876	# candidate for interleaving with 64x64
				877	# multiplication. Pre-modulo-scheduled loop
				878	# was found to be ~20% faster than Algorithm 5
				879	# below. Algorithm 9 was therefore chosen for
				880	# further optimization...
				881
				882	sub reduction_alg9 { # 17/13 times faster than Intel version
				883	my ($Xhi,$Xi) = @_;
				884
				885	# 1st phase
				886	&movdqa ($T1,$Xi) #
				887	&psllq ($Xi,1);
				888	&pxor ($Xi,$T1); #
				889	&psllq ($Xi,5); #
				890	&pxor ($Xi,$T1); #
				891	&psllq ($Xi,57); #
				892	&movdqa ($T2,$Xi); #
				893	&pslldq ($Xi,8);
				894	&psrldq ($T2,8); #
				895	&pxor ($Xi,$T1);
				896	&pxor ($Xhi,$T2); #
				897
				898	# 2nd phase
				899	&movdqa ($T2,$Xi);
				900	&psrlq ($Xi,5);
				901	&pxor ($Xi,$T2); #
				902	&psrlq ($Xi,1); #
				903	&pxor ($Xi,$T2); #
				904	&pxor ($T2,$Xhi);
				905	&psrlq ($Xi,1); #
				906	&pxor ($Xi,$T2); #
				907	}
				908
				909	&function_begin_B("gcm_init_clmul");
				910	&mov ($Htbl,&wparam(0));
				911	&mov ($Xip,&wparam(1));
				912
				913	&call (&label("pic"));
				914	&set_label("pic");
				915	&blindpop ($const);
				916	&lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const));
				917
				918	&movdqu ($Hkey,&QWP(0,$Xip));
				919	&pshufd ($Hkey,$Hkey,0b01001110);# dword swap
				920
				921	# <<1 twist
				922	&pshufd ($T2,$Hkey,0b11111111); # broadcast uppermost dword
				923	&movdqa ($T1,$Hkey);
				924	&psllq ($Hkey,1);
				925	&pxor ($T3,$T3); #
				926	&psrlq ($T1,63);
				927	&pcmpgtd ($T3,$T2); # broadcast carry bit
				928	&pslldq ($T1,8);
				929	&por ($Hkey,$T1); # H<<=1
				930
				931	# magic reduction
				932	&pand ($T3,&QWP(16,$const)); # 0x1c2_polynomial
				933	&pxor ($Hkey,$T3); # if(carry) H^=0x1c2_polynomial
				934
				935	# calculate H^2
				936	&movdqa ($Xi,$Hkey);
				937	&clmul64x64_T2 ($Xhi,$Xi,$Hkey);
				938	&reduction_alg9 ($Xhi,$Xi);
				939
				940	&movdqu (&QWP(0,$Htbl),$Hkey); # save H
				941	&movdqu (&QWP(16,$Htbl),$Xi); # save H^2
				942
				943	&ret ();
				944	&function_end_B("gcm_init_clmul");
				945
				946	&function_begin_B("gcm_gmult_clmul");
				947	&mov ($Xip,&wparam(0));
				948	&mov ($Htbl,&wparam(1));
				949
				950	&call (&label("pic"));
				951	&set_label("pic");
				952	&blindpop ($const);
				953	&lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const));
				954
				955	&movdqu ($Xi,&QWP(0,$Xip));
				956	&movdqa ($T3,&QWP(0,$const));
				957	&movups ($Hkey,&QWP(0,$Htbl));
				958	&pshufb ($Xi,$T3);
				959
				960	&clmul64x64_T2 ($Xhi,$Xi,$Hkey);
				961	&reduction_alg9 ($Xhi,$Xi);
				962
				963	&pshufb ($Xi,$T3);
				964	&movdqu (&QWP(0,$Xip),$Xi);
				965
				966	&ret ();
				967	&function_end_B("gcm_gmult_clmul");
				968
				969	&function_begin("gcm_ghash_clmul");
				970	&mov ($Xip,&wparam(0));
				971	&mov ($Htbl,&wparam(1));
				972	&mov ($inp,&wparam(2));
				973	&mov ($len,&wparam(3));
				974
				975	&call (&label("pic"));
				976	&set_label("pic");
				977	&blindpop ($const);
				978	&lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const));
				979
				980	&movdqu ($Xi,&QWP(0,$Xip));
				981	&movdqa ($T3,&QWP(0,$const));
				982	&movdqu ($Hkey,&QWP(0,$Htbl));
				983	&pshufb ($Xi,$T3);
				984
				985	&sub ($len,0x10);
				986	&jz (&label("odd_tail"));
				987
				988	#######
				989	# Xi+2 =[H*(Ii+1 + Xi+1)] mod P =
				990	# [(HIi+1) + (HXi+1)] mod P =
				991	# [(HIi+1) + H^2(Ii+Xi)] mod P
				992	#
				993	&movdqu ($T1,&QWP(0,$inp)); # Ii
				994	&movdqu ($Xn,&QWP(16,$inp)); # Ii+1
				995	&pshufb ($T1,$T3);
				996	&pshufb ($Xn,$T3);
				997	&pxor ($Xi,$T1); # Ii+Xi
				998
				999	&clmul64x64_T2 ($Xhn,$Xn,$Hkey); # H*Ii+1
				1000	&movups ($Hkey,&QWP(16,$Htbl)); # load H^2
				1001
				1002	&lea ($inp,&DWP(32,$inp)); # i+=2
				1003	&sub ($len,0x20);
				1004	&jbe (&label("even_tail"));
				1005
				1006	&set_label("mod_loop");
				1007	&clmul64x64_T2 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi)
				1008	&movdqu ($T1,&QWP(0,$inp)); # Ii
				1009	&movups ($Hkey,&QWP(0,$Htbl)); # load H
				1010
				1011	&pxor ($Xi,$Xn); # (HIi+1) + H^2(Ii+Xi)
				1012	&pxor ($Xhi,$Xhn);
				1013
				1014	&movdqu ($Xn,&QWP(16,$inp)); # Ii+1
				1015	&pshufb ($T1,$T3);
				1016	&pshufb ($Xn,$T3);
				1017
				1018	&movdqa ($T3,$Xn); #&clmul64x64_TX ($Xhn,$Xn,$Hkey); H*Ii+1
				1019	&movdqa ($Xhn,$Xn);
				1020	&pxor ($Xhi,$T1); # "Ii+Xi", consume early
				1021
				1022	&movdqa ($T1,$Xi) #&reduction_alg9($Xhi,$Xi); 1st phase
				1023	&psllq ($Xi,1);
				1024	&pxor ($Xi,$T1); #
				1025	&psllq ($Xi,5); #
				1026	&pxor ($Xi,$T1); #
				1027	&pclmulqdq ($Xn,$Hkey,0x00); #######
				1028	&psllq ($Xi,57); #
				1029	&movdqa ($T2,$Xi); #
				1030	&pslldq ($Xi,8);
				1031	&psrldq ($T2,8); #
				1032	&pxor ($Xi,$T1);
				1033	&pshufd ($T1,$T3,0b01001110);
				1034	&pxor ($Xhi,$T2); #
				1035	&pxor ($T1,$T3);
				1036	&pshufd ($T3,$Hkey,0b01001110);
				1037	&pxor ($T3,$Hkey); #
				1038
				1039	&pclmulqdq ($Xhn,$Hkey,0x11); #######
				1040	&movdqa ($T2,$Xi); # 2nd phase
				1041	&psrlq ($Xi,5);
				1042	&pxor ($Xi,$T2); #
				1043	&psrlq ($Xi,1); #
				1044	&pxor ($Xi,$T2); #
				1045	&pxor ($T2,$Xhi);
				1046	&psrlq ($Xi,1); #
				1047	&pxor ($Xi,$T2); #
				1048
				1049	&pclmulqdq ($T1,$T3,0x00); #######
				1050	&movups ($Hkey,&QWP(16,$Htbl)); # load H^2
				1051	&xorps ($T1,$Xn); #
				1052	&xorps ($T1,$Xhn); #
				1053
				1054	&movdqa ($T3,$T1); #
				1055	&psrldq ($T1,8);
				1056	&pslldq ($T3,8); #
				1057	&pxor ($Xhn,$T1);
				1058	&pxor ($Xn,$T3); #
				1059	&movdqa ($T3,&QWP(0,$const));
				1060
				1061	&lea ($inp,&DWP(32,$inp));
				1062	&sub ($len,0x20);
				1063	&ja (&label("mod_loop"));
				1064
				1065	&set_label("even_tail");
				1066	&clmul64x64_T2 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi)
				1067
				1068	&pxor ($Xi,$Xn); # (HIi+1) + H^2(Ii+Xi)
				1069	&pxor ($Xhi,$Xhn);
				1070
				1071	&reduction_alg9 ($Xhi,$Xi);
				1072
				1073	&test ($len,$len);
				1074	&jnz (&label("done"));
				1075
				1076	&movups ($Hkey,&QWP(0,$Htbl)); # load H
				1077	&set_label("odd_tail");
				1078	&movdqu ($T1,&QWP(0,$inp)); # Ii
				1079	&pshufb ($T1,$T3);
				1080	&pxor ($Xi,$T1); # Ii+Xi
				1081
				1082	&clmul64x64_T2 ($Xhi,$Xi,$Hkey); # H*(Ii+Xi)
				1083	&reduction_alg9 ($Xhi,$Xi);
				1084
				1085	&set_label("done");
				1086	&pshufb ($Xi,$T3);
				1087	&movdqu (&QWP(0,$Xip),$Xi);
				1088	&function_end("gcm_ghash_clmul");
				1089
				1090	} else { # Algorith 5. Kept for reference purposes.
				1091
				1092	sub reduction_alg5 { # 19/16 times faster than Intel version
				1093	my ($Xhi,$Xi)=@_;
				1094
				1095	# <<1
				1096	&movdqa ($T1,$Xi); #
				1097	&movdqa ($T2,$Xhi);
				1098	&pslld ($Xi,1);
				1099	&pslld ($Xhi,1); #
				1100	&psrld ($T1,31);
				1101	&psrld ($T2,31); #
				1102	&movdqa ($T3,$T1);
				1103	&pslldq ($T1,4);
				1104	&psrldq ($T3,12); #
				1105	&pslldq ($T2,4);
				1106	&por ($Xhi,$T3); #
				1107	&por ($Xi,$T1);
				1108	&por ($Xhi,$T2); #
				1109
				1110	# 1st phase
				1111	&movdqa ($T1,$Xi);
				1112	&movdqa ($T2,$Xi);
				1113	&movdqa ($T3,$Xi); #
				1114	&pslld ($T1,31);
				1115	&pslld ($T2,30);
				1116	&pslld ($Xi,25); #
				1117	&pxor ($T1,$T2);
				1118	&pxor ($T1,$Xi); #
				1119	&movdqa ($T2,$T1); #
				1120	&pslldq ($T1,12);
				1121	&psrldq ($T2,4); #
				1122	&pxor ($T3,$T1);
				1123
				1124	# 2nd phase
				1125	&pxor ($Xhi,$T3); #
				1126	&movdqa ($Xi,$T3);
				1127	&movdqa ($T1,$T3);
				1128	&psrld ($Xi,1); #
				1129	&psrld ($T1,2);
				1130	&psrld ($T3,7); #
				1131	&pxor ($Xi,$T1);
				1132	&pxor ($Xhi,$T2);
				1133	&pxor ($Xi,$T3); #
				1134	&pxor ($Xi,$Xhi); #
				1135	}
				1136
				1137	&function_begin_B("gcm_init_clmul");
				1138	&mov ($Htbl,&wparam(0));
				1139	&mov ($Xip,&wparam(1));
				1140
				1141	&call (&label("pic"));
				1142	&set_label("pic");
				1143	&blindpop ($const);
				1144	&lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const));
				1145
				1146	&movdqu ($Hkey,&QWP(0,$Xip));
				1147	&pshufd ($Hkey,$Hkey,0b01001110);# dword swap
				1148
				1149	# calculate H^2
				1150	&movdqa ($Xi,$Hkey);
				1151	&clmul64x64_T3 ($Xhi,$Xi,$Hkey);
				1152	&reduction_alg5 ($Xhi,$Xi);
				1153
				1154	&movdqu (&QWP(0,$Htbl),$Hkey); # save H
				1155	&movdqu (&QWP(16,$Htbl),$Xi); # save H^2
				1156
				1157	&ret ();
				1158	&function_end_B("gcm_init_clmul");
				1159
				1160	&function_begin_B("gcm_gmult_clmul");
				1161	&mov ($Xip,&wparam(0));
				1162	&mov ($Htbl,&wparam(1));
				1163
				1164	&call (&label("pic"));
				1165	&set_label("pic");
				1166	&blindpop ($const);
				1167	&lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const));
				1168
				1169	&movdqu ($Xi,&QWP(0,$Xip));
				1170	&movdqa ($Xn,&QWP(0,$const));
				1171	&movdqu ($Hkey,&QWP(0,$Htbl));
				1172	&pshufb ($Xi,$Xn);
				1173
				1174	&clmul64x64_T3 ($Xhi,$Xi,$Hkey);
				1175	&reduction_alg5 ($Xhi,$Xi);
				1176
				1177	&pshufb ($Xi,$Xn);
				1178	&movdqu (&QWP(0,$Xip),$Xi);
				1179
				1180	&ret ();
				1181	&function_end_B("gcm_gmult_clmul");
				1182
				1183	&function_begin("gcm_ghash_clmul");
				1184	&mov ($Xip,&wparam(0));
				1185	&mov ($Htbl,&wparam(1));
				1186	&mov ($inp,&wparam(2));
				1187	&mov ($len,&wparam(3));
				1188
				1189	&call (&label("pic"));
				1190	&set_label("pic");
				1191	&blindpop ($const);
				1192	&lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const));
				1193
				1194	&movdqu ($Xi,&QWP(0,$Xip));
				1195	&movdqa ($T3,&QWP(0,$const));
				1196	&movdqu ($Hkey,&QWP(0,$Htbl));
				1197	&pshufb ($Xi,$T3);
				1198
				1199	&sub ($len,0x10);
				1200	&jz (&label("odd_tail"));
				1201
				1202	#######
				1203	# Xi+2 =[H*(Ii+1 + Xi+1)] mod P =
				1204	# [(HIi+1) + (HXi+1)] mod P =
				1205	# [(HIi+1) + H^2(Ii+Xi)] mod P
				1206	#
				1207	&movdqu ($T1,&QWP(0,$inp)); # Ii
				1208	&movdqu ($Xn,&QWP(16,$inp)); # Ii+1
				1209	&pshufb ($T1,$T3);
				1210	&pshufb ($Xn,$T3);
				1211	&pxor ($Xi,$T1); # Ii+Xi
				1212
				1213	&clmul64x64_T3 ($Xhn,$Xn,$Hkey); # H*Ii+1
				1214	&movdqu ($Hkey,&QWP(16,$Htbl)); # load H^2
				1215
				1216	&sub ($len,0x20);
				1217	&lea ($inp,&DWP(32,$inp)); # i+=2
				1218	&jbe (&label("even_tail"));
				1219
				1220	&set_label("mod_loop");
				1221	&clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi)
				1222	&movdqu ($Hkey,&QWP(0,$Htbl)); # load H
				1223
				1224	&pxor ($Xi,$Xn); # (HIi+1) + H^2(Ii+Xi)
				1225	&pxor ($Xhi,$Xhn);
				1226
				1227	&reduction_alg5 ($Xhi,$Xi);
				1228
				1229	#######
				1230	&movdqa ($T3,&QWP(0,$const));
				1231	&movdqu ($T1,&QWP(0,$inp)); # Ii
				1232	&movdqu ($Xn,&QWP(16,$inp)); # Ii+1
				1233	&pshufb ($T1,$T3);
				1234	&pshufb ($Xn,$T3);
				1235	&pxor ($Xi,$T1); # Ii+Xi
				1236
				1237	&clmul64x64_T3 ($Xhn,$Xn,$Hkey); # H*Ii+1
				1238	&movdqu ($Hkey,&QWP(16,$Htbl)); # load H^2
				1239
				1240	&sub ($len,0x20);
				1241	&lea ($inp,&DWP(32,$inp));
				1242	&ja (&label("mod_loop"));
				1243
				1244	&set_label("even_tail");
				1245	&clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi)
				1246
				1247	&pxor ($Xi,$Xn); # (HIi+1) + H^2(Ii+Xi)
				1248	&pxor ($Xhi,$Xhn);
				1249
				1250	&reduction_alg5 ($Xhi,$Xi);
				1251
				1252	&movdqa ($T3,&QWP(0,$const));
				1253	&test ($len,$len);
				1254	&jnz (&label("done"));
				1255
				1256	&movdqu ($Hkey,&QWP(0,$Htbl)); # load H
				1257	&set_label("odd_tail");
				1258	&movdqu ($T1,&QWP(0,$inp)); # Ii
				1259	&pshufb ($T1,$T3);
				1260	&pxor ($Xi,$T1); # Ii+Xi
				1261
				1262	&clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H*(Ii+Xi)
				1263	&reduction_alg5 ($Xhi,$Xi);
				1264
				1265	&movdqa ($T3,&QWP(0,$const));
				1266	&set_label("done");
				1267	&pshufb ($Xi,$T3);
				1268	&movdqu (&QWP(0,$Xip),$Xi);
				1269	&function_end("gcm_ghash_clmul");
				1270
				1271	}
				1272
				1273	&set_label("bswap",64);
				1274	&data_byte(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0);
				1275	&data_byte(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2); # 0x1c2_polynomial
				1276	}} # $sse2
				1277
				1278	&set_label("rem_4bit",64);
				1279	&data_word(0,0x0000<<$S,0,0x1C20<<$S,0,0x3840<<$S,0,0x2460<<$S);
				1280	&data_word(0,0x7080<<$S,0,0x6CA0<<$S,0,0x48C0<<$S,0,0x54E0<<$S);
				1281	&data_word(0,0xE100<<$S,0,0xFD20<<$S,0,0xD940<<$S,0,0xC560<<$S);
				1282	&data_word(0,0x9180<<$S,0,0x8DA0<<$S,0,0xA9C0<<$S,0,0xB5E0<<$S);
				1283	&set_label("rem_8bit",64);
				1284	&data_short(0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E);
				1285	&data_short(0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E);
				1286	&data_short(0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E);
				1287	&data_short(0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E);
				1288	&data_short(0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E);
				1289	&data_short(0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E);
				1290	&data_short(0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E);
				1291	&data_short(0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E);
				1292	&data_short(0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE);
				1293	&data_short(0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE);
				1294	&data_short(0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE);
				1295	&data_short(0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE);
				1296	&data_short(0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E);
				1297	&data_short(0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E);
				1298	&data_short(0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE);
				1299	&data_short(0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE);
				1300	&data_short(0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E);
				1301	&data_short(0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E);
				1302	&data_short(0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E);
				1303	&data_short(0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E);
				1304	&data_short(0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E);
				1305	&data_short(0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E);
				1306	&data_short(0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E);
				1307	&data_short(0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E);
				1308	&data_short(0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE);
				1309	&data_short(0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE);
				1310	&data_short(0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE);
				1311	&data_short(0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE);
				1312	&data_short(0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E);
				1313	&data_short(0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E);
				1314	&data_short(0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE);
				1315	&data_short(0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE);
				1316	}}} # !$x86only
				1317
				1318	&asciz("GHASH for x86, CRYPTOGAMS by <appro\@openssl.org>");
				1319	&asm_finish();
				1320
				1321	# A question was risen about choice of vanilla MMX. Or rather why wasn't
				1322	# SSE2 chosen instead? In addition to the fact that MMX runs on legacy
				1323	# CPUs such as PIII, "4-bit" MMX version was observed to provide better
				1324	# performance than corresponding SSE2 one even on contemporary CPUs.
				1325	# SSE2 results were provided by Peter-Michael Hager. He maintains SSE2
				1326	# implementation featuring full range of lookup-table sizes, but with
				1327	# per-invocation lookup table setup. Latter means that table size is
				1328	# chosen depending on how much data is to be hashed in every given call,
				1329	# more data - larger table. Best reported result for Core2 is ~4 cycles
				1330	# per processed byte out of 64KB block. This number accounts even for
				1331	# 64KB table setup overhead. As discussed in gcm128.c we choose to be
				1332	# more conservative in respect to lookup table sizes, but how do the
				1333	# results compare? Minimalistic "256B" MMX version delivers ~11 cycles
				1334	# on same platform. As also discussed in gcm128.c, next in line "8-bit
				1335	# Shoup's" or "4KB" method should deliver twice the performance of
				1336	# "256B" one, in other words not worse than ~6 cycles per byte. It
				1337	# should be also be noted that in SSE2 case improvement can be "super-
				1338	# linear," i.e. more than twice, mostly because >>8 maps to single
				1339	# instruction on SSE2 register. This is unlike "4-bit" case when >>4
				1340	# maps to same amount of instructions in both MMX and SSE2 cases.
				1341	# Bottom line is that switch to SSE2 is considered to be justifiable
				1342	# only in case we choose to implement "8-bit" method...