From nobody Fri Mar 29 08:31:43 2024 Delivered-To: importer2@patchew.org Authentication-Results: mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer2=patchew.org@nongnu.org; dmarc=pass(p=none dis=none) header.from=quicinc.com ARC-Seal: i=1; a=rsa-sha256; t=1674614658; cv=none; d=zohomail.com; s=zohoarc; b=nVLp/CPBz+yiTbpitX/hpRNidZ+CvrxkIlqiH/5Ss6yx1wg2DBnpY+lmi9tbU19JvfWkJz0vnM1WFGCun4X3pIn1gYxRFcK7qH0jc99IjDbDJtd4/tO8rezlA+PkA50deczYz23UQAG+0byTxaJQfP2mJjFRAsWLJ+pXfzJE4r8= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1674614658; h=Content-Type:Content-Transfer-Encoding:Cc:Date:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:MIME-Version:Message-ID:References:Sender:Subject:To; bh=Ra8/1489eqdU17G/MSzyM/vJBeqSYDlliaZQl5rkcms=; b=QdxPqr9x8Wp3tlRVm3o8Dh+L74gw4SKkqkow1hNdVihOuxzJs1ggu7zBvrktKAxq+p93rb3JyqfTO06dMa1z6Y4fxXeldpfgP6iajq77QFaN+QmHd0uKtkYFyCo5tr4rgQ0Ok87FaZ0b4qNGrn9+nfDzWahOn61Gk0x6UoXuB9M= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer2=patchew.org@nongnu.org; dmarc=pass header.from= (p=none dis=none) Return-Path: Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1674614658689352.48043078547585; Tue, 24 Jan 2023 18:44:18 -0800 (PST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1pKVkH-0006lT-Kx; Tue, 24 Jan 2023 21:42:41 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pKVkB-0006jO-Gs for qemu-devel@nongnu.org; Tue, 24 Jan 2023 21:42:35 -0500 Received: from mx0a-0031df01.pphosted.com ([205.220.168.131]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pKVk7-00079W-4W for qemu-devel@nongnu.org; Tue, 24 Jan 2023 21:42:35 -0500 Received: from pps.filterd (m0279865.ppops.net [127.0.0.1]) by mx0a-0031df01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 30P2VUA5010931; Wed, 25 Jan 2023 02:42:19 GMT Received: from nalasppmta05.qualcomm.com (Global_NAT1.qualcomm.com [129.46.96.20]) by mx0a-0031df01.pphosted.com (PPS) with ESMTPS id 3najqa8ysw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 25 Jan 2023 02:42:18 +0000 Received: from pps.filterd (NALASPPMTA05.qualcomm.com [127.0.0.1]) by NALASPPMTA05.qualcomm.com (8.17.1.5/8.17.1.5) with ESMTP id 30P2gHXU015758; Wed, 25 Jan 2023 02:42:17 GMT Received: from pps.reinject (localhost [127.0.0.1]) by NALASPPMTA05.qualcomm.com (PPS) with ESMTP id 3n894m874b-1; Wed, 25 Jan 2023 02:42:17 +0000 Received: from NALASPPMTA05.qualcomm.com (NALASPPMTA05.qualcomm.com [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 30P2gHpI015747; Wed, 25 Jan 2023 02:42:17 GMT Received: from hu-devc-lv-u18-c.qualcomm.com (hu-tsimpson-lv.qualcomm.com [10.47.235.220]) by NALASPPMTA05.qualcomm.com (PPS) with ESMTP id 30P2gHp7015741; Wed, 25 Jan 2023 02:42:17 +0000 Received: by hu-devc-lv-u18-c.qualcomm.com (Postfix, from userid 47164) id 88046500105; Tue, 24 Jan 2023 18:42:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quicinc.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type : content-transfer-encoding; s=qcppdkim1; bh=Ra8/1489eqdU17G/MSzyM/vJBeqSYDlliaZQl5rkcms=; b=RVYV7PX/NN1XA6kNc02lm1YUvoL5Ds5h7BZbVQkrFL2ajQWzRf9IK8bQ8orcBoLOMnQ2 2OY9xGgFF0DElj3GyQBRbu8pxaeqN5wuR58JY4WjjU1DSGIjMT8MgH22u0pM05pouLhX wsV/ns3yJzq6ED4XcEw+q1OM6pKyGVsWinWg2coKtxD4qArGI76de6K/ZIY0bcAmlwMi r/rdyxXgLEumcvHckpg4+J6y7xn0Ep6rSb9yoDNDKHt2GMZftyhKRyUYzpmwPS3O5efa bZq7kd10BAxZpXUiDUL2tlsR/cosbMDUIqQew64zj+SU2JXk8+3WKx+At2b32TSasCrC iA== From: Taylor Simpson To: qemu-devel@nongnu.org Cc: tsimpson@quicinc.com, richard.henderson@linaro.org, philmd@linaro.org, ale@rev.ng, anjo@rev.ng, bcain@quicinc.com, quic_mathbern@quicinc.com Subject: [PATCH v4 08/13] Hexagon (tests/tcg/hexagon) Remove __builtin from scatter_gather Date: Tue, 24 Jan 2023 18:42:10 -0800 Message-Id: <20230125024215.10430-9-tsimpson@quicinc.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20230125024215.10430-1-tsimpson@quicinc.com> References: <20230125024215.10430-1-tsimpson@quicinc.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-QCInternal: smtphost X-QCInternal: smtphost X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=5800 signatures=585085 X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=5800 signatures=585085 X-Proofpoint-GUID: RihGe7ElKJ20Oj5DIakjhjO0cnSbuACX X-Proofpoint-ORIG-GUID: RihGe7ElKJ20Oj5DIakjhjO0cnSbuACX X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.219,Aquarius:18.0.930,Hydra:6.0.562,FMLib:17.11.122.1 definitions=2023-01-24_17,2023-01-24_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 bulkscore=0 malwarescore=0 adultscore=0 spamscore=0 mlxscore=0 clxscore=1015 impostorscore=0 priorityscore=1501 mlxlogscore=980 suspectscore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2212070000 definitions=main-2301250020 Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer2=patchew.org@nongnu.org; helo=lists.gnu.org; Received-SPF: pass client-ip=205.220.168.131; envelope-from=tsimpson@qualcomm.com; helo=mx0a-0031df01.pphosted.com X-Spam_score_int: -24 X-Spam_score: -2.5 X-Spam_bar: -- X-Spam_report: (-2.5 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.248, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+importer2=patchew.org@nongnu.org Sender: qemu-devel-bounces+importer2=patchew.org@nongnu.org X-ZohoMail-DKIM: pass (identity @quicinc.com) X-ZM-MESSAGEID: 1674614659084100001 Replace __builtin_* with inline assembly The __builtin's are subject to change with different compiler releases, so might break Mark arrays as aligned when accessed as HVX vectors Clean up comments Signed-off-by: Taylor Simpson --- tests/tcg/hexagon/scatter_gather.c | 513 +++++++++++++++-------------- 1 file changed, 271 insertions(+), 242 deletions(-) diff --git a/tests/tcg/hexagon/scatter_gather.c b/tests/tcg/hexagon/scatter= _gather.c index b93eb18133..bf8b5e0317 100644 --- a/tests/tcg/hexagon/scatter_gather.c +++ b/tests/tcg/hexagon/scatter_gather.c @@ -1,5 +1,5 @@ /* - * Copyright(c) 2019-2021 Qualcomm Innovation Center, Inc. All Rights Res= erved. + * Copyright(c) 2019-2023 Qualcomm Innovation Center, Inc. All Rights Res= erved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -40,47 +40,6 @@ typedef long HVX_VectorPair __attribute__((__vector_si= ze__(256))) typedef long HVX_VectorPred __attribute__((__vector_size__(128))) __attribute__((aligned(128))); =20 -#define VSCATTER_16(BASE, RGN, OFF, VALS) \ - __builtin_HEXAGON_V6_vscattermh_128B((int)BASE, RGN, OFF, VALS) -#define VSCATTER_16_MASKED(MASK, BASE, RGN, OFF, VALS) \ - __builtin_HEXAGON_V6_vscattermhq_128B(MASK, (int)BASE, RGN, OFF, VALS) -#define VSCATTER_32(BASE, RGN, OFF, VALS) \ - __builtin_HEXAGON_V6_vscattermw_128B((int)BASE, RGN, OFF, VALS) -#define VSCATTER_32_MASKED(MASK, BASE, RGN, OFF, VALS) \ - __builtin_HEXAGON_V6_vscattermwq_128B(MASK, (int)BASE, RGN, OFF, VALS) -#define VSCATTER_16_32(BASE, RGN, OFF, VALS) \ - __builtin_HEXAGON_V6_vscattermhw_128B((int)BASE, RGN, OFF, VALS) -#define VSCATTER_16_32_MASKED(MASK, BASE, RGN, OFF, VALS) \ - __builtin_HEXAGON_V6_vscattermhwq_128B(MASK, (int)BASE, RGN, OFF, VALS) -#define VSCATTER_16_ACC(BASE, RGN, OFF, VALS) \ - __builtin_HEXAGON_V6_vscattermh_add_128B((int)BASE, RGN, OFF, VALS) -#define VSCATTER_32_ACC(BASE, RGN, OFF, VALS) \ - __builtin_HEXAGON_V6_vscattermw_add_128B((int)BASE, RGN, OFF, VALS) -#define VSCATTER_16_32_ACC(BASE, RGN, OFF, VALS) \ - __builtin_HEXAGON_V6_vscattermhw_add_128B((int)BASE, RGN, OFF, VALS) - -#define VGATHER_16(DSTADDR, BASE, RGN, OFF) \ - __builtin_HEXAGON_V6_vgathermh_128B(DSTADDR, (int)BASE, RGN, OFF) -#define VGATHER_16_MASKED(DSTADDR, MASK, BASE, RGN, OFF) \ - __builtin_HEXAGON_V6_vgathermhq_128B(DSTADDR, MASK, (int)BASE, RGN, OF= F) -#define VGATHER_32(DSTADDR, BASE, RGN, OFF) \ - __builtin_HEXAGON_V6_vgathermw_128B(DSTADDR, (int)BASE, RGN, OFF) -#define VGATHER_32_MASKED(DSTADDR, MASK, BASE, RGN, OFF) \ - __builtin_HEXAGON_V6_vgathermwq_128B(DSTADDR, MASK, (int)BASE, RGN, OF= F) -#define VGATHER_16_32(DSTADDR, BASE, RGN, OFF) \ - __builtin_HEXAGON_V6_vgathermhw_128B(DSTADDR, (int)BASE, RGN, OFF) -#define VGATHER_16_32_MASKED(DSTADDR, MASK, BASE, RGN, OFF) \ - __builtin_HEXAGON_V6_vgathermhwq_128B(DSTADDR, MASK, (int)BASE, RGN, O= FF) - -#define VSHUFF_H(V) \ - __builtin_HEXAGON_V6_vshuffh_128B(V) -#define VSPLAT_H(X) \ - __builtin_HEXAGON_V6_lvsplath_128B(X) -#define VAND_VAL(PRED, VAL) \ - __builtin_HEXAGON_V6_vandvrt_128B(PRED, VAL) -#define VDEAL_H(V) \ - __builtin_HEXAGON_V6_vdealh_128B(V) - int err; =20 /* define the number of rows/cols in a square matrix */ @@ -108,22 +67,22 @@ unsigned short vscatter16_32_ref[SCATTER_BUFFER_SIZE]; unsigned short vgather16_32_ref[MATRIX_SIZE]; =20 /* declare the arrays of offsets */ -unsigned short half_offsets[MATRIX_SIZE]; -unsigned int word_offsets[MATRIX_SIZE]; +unsigned short half_offsets[MATRIX_SIZE] __attribute__((aligned(128))); +unsigned int word_offsets[MATRIX_SIZE] __attribute__((aligned(128))); =20 /* declare the arrays of values */ -unsigned short half_values[MATRIX_SIZE]; -unsigned short half_values_acc[MATRIX_SIZE]; -unsigned short half_values_masked[MATRIX_SIZE]; -unsigned int word_values[MATRIX_SIZE]; -unsigned int word_values_acc[MATRIX_SIZE]; -unsigned int word_values_masked[MATRIX_SIZE]; +unsigned short half_values[MATRIX_SIZE] __attribute__((aligned(128))); +unsigned short half_values_acc[MATRIX_SIZE] __attribute__((aligned(128))); +unsigned short half_values_masked[MATRIX_SIZE] __attribute__((aligned(128)= )); +unsigned int word_values[MATRIX_SIZE] __attribute__((aligned(128))); +unsigned int word_values_acc[MATRIX_SIZE] __attribute__((aligned(128))); +unsigned int word_values_masked[MATRIX_SIZE] __attribute__((aligned(128)= )); =20 /* declare the arrays of predicates */ -unsigned short half_predicates[MATRIX_SIZE]; -unsigned int word_predicates[MATRIX_SIZE]; +unsigned short half_predicates[MATRIX_SIZE] __attribute__((aligned(128))); +unsigned int word_predicates[MATRIX_SIZE] __attribute__((aligned(128))); =20 -/* make this big enough for all the intrinsics */ +/* make this big enough for all the operations */ const size_t region_len =3D sizeof(vtcm); =20 /* optionally add sync instructions */ @@ -261,164 +220,201 @@ void create_offsets_values_preds_16_32(void) } } =20 -/* scatter the 16 bit elements using intrinsics */ +/* scatter the 16 bit elements using HVX */ void vector_scatter_16(void) { - /* copy the offsets and values to vectors */ - HVX_Vector offsets =3D *(HVX_Vector *)half_offsets; - HVX_Vector values =3D *(HVX_Vector *)half_values; - - VSCATTER_16(&vtcm.vscatter16, region_len, offsets, values); + asm ("m0 =3D %1\n\t" + "v0 =3D vmem(%2 + #0)\n\t" + "v1 =3D vmem(%3 + #0)\n\t" + "vscatter(%0, m0, v0.h).h =3D v1\n\t" + : : "r"(vtcm.vscatter16), "r"(region_len), + "r"(half_offsets), "r"(half_values) + : "m0", "v0", "v1", "memory"); =20 sync_scatter(vtcm.vscatter16); } =20 -/* scatter-accumulate the 16 bit elements using intrinsics */ +/* scatter-accumulate the 16 bit elements using HVX */ void vector_scatter_16_acc(void) { - /* copy the offsets and values to vectors */ - HVX_Vector offsets =3D *(HVX_Vector *)half_offsets; - HVX_Vector values =3D *(HVX_Vector *)half_values_acc; - - VSCATTER_16_ACC(&vtcm.vscatter16, region_len, offsets, values); + asm ("m0 =3D %1\n\t" + "v0 =3D vmem(%2 + #0)\n\t" + "v1 =3D vmem(%3 + #0)\n\t" + "vscatter(%0, m0, v0.h).h +=3D v1\n\t" + : : "r"(vtcm.vscatter16), "r"(region_len), + "r"(half_offsets), "r"(half_values_acc) + : "m0", "v0", "v1", "memory"); =20 sync_scatter(vtcm.vscatter16); } =20 -/* scatter the 16 bit elements using intrinsics */ +/* masked scatter the 16 bit elements using HVX */ void vector_scatter_16_masked(void) { - /* copy the offsets and values to vectors */ - HVX_Vector offsets =3D *(HVX_Vector *)half_offsets; - HVX_Vector values =3D *(HVX_Vector *)half_values_masked; - HVX_Vector pred_reg =3D *(HVX_Vector *)half_predicates; - HVX_VectorPred preds =3D VAND_VAL(pred_reg, ~0); - - VSCATTER_16_MASKED(preds, &vtcm.vscatter16, region_len, offsets, value= s); + asm ("r1 =3D #-1\n\t" + "v0 =3D vmem(%0 + #0)\n\t" + "q0 =3D vand(v0, r1)\n\t" + "m0 =3D %2\n\t" + "v0 =3D vmem(%3 + #0)\n\t" + "v1 =3D vmem(%4 + #0)\n\t" + "if (q0) vscatter(%1, m0, v0.h).h =3D v1\n\t" + : : "r"(half_predicates), "r"(vtcm.vscatter16), "r"(region_len), + "r"(half_offsets), "r"(half_values_masked) + : "r1", "q0", "m0", "q0", "v0", "v1", "memory"); =20 sync_scatter(vtcm.vscatter16); } =20 -/* scatter the 32 bit elements using intrinsics */ +/* scatter the 32 bit elements using HVX */ void vector_scatter_32(void) { - /* copy the offsets and values to vectors */ - HVX_Vector offsetslo =3D *(HVX_Vector *)word_offsets; - HVX_Vector offsetshi =3D *(HVX_Vector *)&word_offsets[MATRIX_SIZE / 2]; - HVX_Vector valueslo =3D *(HVX_Vector *)word_values; - HVX_Vector valueshi =3D *(HVX_Vector *)&word_values[MATRIX_SIZE / 2]; - - VSCATTER_32(&vtcm.vscatter32, region_len, offsetslo, valueslo); - VSCATTER_32(&vtcm.vscatter32, region_len, offsetshi, valueshi); + HVX_Vector *offsetslo =3D (HVX_Vector *)word_offsets; + HVX_Vector *offsetshi =3D (HVX_Vector *)&word_offsets[MATRIX_SIZE / 2]; + HVX_Vector *valueslo =3D (HVX_Vector *)word_values; + HVX_Vector *valueshi =3D (HVX_Vector *)&word_values[MATRIX_SIZE / 2]; + + asm ("m0 =3D %1\n\t" + "v0 =3D vmem(%2 + #0)\n\t" + "v1 =3D vmem(%3 + #0)\n\t" + "vscatter(%0, m0, v0.w).w =3D v1\n\t" + : : "r"(vtcm.vscatter32), "r"(region_len), + "r"(offsetslo), "r"(valueslo) + : "m0", "v0", "v1", "memory"); + asm ("m0 =3D %1\n\t" + "v0 =3D vmem(%2 + #0)\n\t" + "v1 =3D vmem(%3 + #0)\n\t" + "vscatter(%0, m0, v0.w).w =3D v1\n\t" + : : "r"(vtcm.vscatter32), "r"(region_len), + "r"(offsetshi), "r"(valueshi) + : "m0", "v0", "v1", "memory"); =20 sync_scatter(vtcm.vscatter32); } =20 -/* scatter-acc the 32 bit elements using intrinsics */ +/* scatter-accumulate the 32 bit elements using HVX */ void vector_scatter_32_acc(void) { - /* copy the offsets and values to vectors */ - HVX_Vector offsetslo =3D *(HVX_Vector *)word_offsets; - HVX_Vector offsetshi =3D *(HVX_Vector *)&word_offsets[MATRIX_SIZE / 2]; - HVX_Vector valueslo =3D *(HVX_Vector *)word_values_acc; - HVX_Vector valueshi =3D *(HVX_Vector *)&word_values_acc[MATRIX_SIZE / = 2]; - - VSCATTER_32_ACC(&vtcm.vscatter32, region_len, offsetslo, valueslo); - VSCATTER_32_ACC(&vtcm.vscatter32, region_len, offsetshi, valueshi); + HVX_Vector *offsetslo =3D (HVX_Vector *)word_offsets; + HVX_Vector *offsetshi =3D (HVX_Vector *)&word_offsets[MATRIX_SIZE / 2]; + HVX_Vector *valueslo =3D (HVX_Vector *)word_values_acc; + HVX_Vector *valueshi =3D (HVX_Vector *)&word_values_acc[MATRIX_SIZE / = 2]; + + asm ("m0 =3D %1\n\t" + "v0 =3D vmem(%2 + #0)\n\t" + "v1 =3D vmem(%3 + #0)\n\t" + "vscatter(%0, m0, v0.w).w +=3D v1\n\t" + : : "r"(vtcm.vscatter32), "r"(region_len), + "r"(offsetslo), "r"(valueslo) + : "m0", "v0", "v1", "memory"); + asm ("m0 =3D %1\n\t" + "v0 =3D vmem(%2 + #0)\n\t" + "v1 =3D vmem(%3 + #0)\n\t" + "vscatter(%0, m0, v0.w).w +=3D v1\n\t" + : : "r"(vtcm.vscatter32), "r"(region_len), + "r"(offsetshi), "r"(valueshi) + : "m0", "v0", "v1", "memory"); =20 sync_scatter(vtcm.vscatter32); } =20 -/* scatter the 32 bit elements using intrinsics */ +/* masked scatter the 32 bit elements using HVX */ void vector_scatter_32_masked(void) { - /* copy the offsets and values to vectors */ - HVX_Vector offsetslo =3D *(HVX_Vector *)word_offsets; - HVX_Vector offsetshi =3D *(HVX_Vector *)&word_offsets[MATRIX_SIZE / 2]; - HVX_Vector valueslo =3D *(HVX_Vector *)word_values_masked; - HVX_Vector valueshi =3D *(HVX_Vector *)&word_values_masked[MATRIX_SIZE= / 2]; - HVX_Vector pred_reglo =3D *(HVX_Vector *)word_predicates; - HVX_Vector pred_reghi =3D *(HVX_Vector *)&word_predicates[MATRIX_SIZE = / 2]; - HVX_VectorPred predslo =3D VAND_VAL(pred_reglo, ~0); - HVX_VectorPred predshi =3D VAND_VAL(pred_reghi, ~0); - - VSCATTER_32_MASKED(predslo, &vtcm.vscatter32, region_len, offsetslo, - valueslo); - VSCATTER_32_MASKED(predshi, &vtcm.vscatter32, region_len, offsetshi, - valueshi); + HVX_Vector *offsetslo =3D (HVX_Vector *)word_offsets; + HVX_Vector *offsetshi =3D (HVX_Vector *)&word_offsets[MATRIX_SIZE / 2]; + HVX_Vector *valueslo =3D (HVX_Vector *)word_values_masked; + HVX_Vector *valueshi =3D (HVX_Vector *)&word_values_masked[MATRIX_SIZE= / 2]; + HVX_Vector *predslo =3D (HVX_Vector *)word_predicates; + HVX_Vector *predshi =3D (HVX_Vector *)&word_predicates[MATRIX_SIZE / 2= ]; + + asm ("r1 =3D #-1\n\t" + "v0 =3D vmem(%0 + #0)\n\t" + "q0 =3D vand(v0, r1)\n\t" + "m0 =3D %2\n\t" + "v0 =3D vmem(%3 + #0)\n\t" + "v1 =3D vmem(%4 + #0)\n\t" + "if (q0) vscatter(%1, m0, v0.w).w =3D v1\n\t" + : : "r"(predslo), "r"(vtcm.vscatter32), "r"(region_len), + "r"(offsetslo), "r"(valueslo) + : "r1", "q0", "m0", "q0", "v0", "v1", "memory"); + asm ("r1 =3D #-1\n\t" + "v0 =3D vmem(%0 + #0)\n\t" + "q0 =3D vand(v0, r1)\n\t" + "m0 =3D %2\n\t" + "v0 =3D vmem(%3 + #0)\n\t" + "v1 =3D vmem(%4 + #0)\n\t" + "if (q0) vscatter(%1, m0, v0.w).w =3D v1\n\t" + : : "r"(predshi), "r"(vtcm.vscatter32), "r"(region_len), + "r"(offsetshi), "r"(valueshi) + : "r1", "q0", "m0", "q0", "v0", "v1", "memory"); =20 - sync_scatter(vtcm.vscatter16); + sync_scatter(vtcm.vscatter32); } =20 -/* scatter the 16 bit elements with 32 bit offsets using intrinsics */ +/* scatter the 16 bit elements with 32 bit offsets using HVX */ void vector_scatter_16_32(void) { - HVX_VectorPair offsets; - HVX_Vector values; - - /* get the word offsets in a vector pair */ - offsets =3D *(HVX_VectorPair *)word_offsets; - - /* these values need to be shuffled for the scatter */ - values =3D *(HVX_Vector *)half_values; - values =3D VSHUFF_H(values); - - VSCATTER_16_32(&vtcm.vscatter16_32, region_len, offsets, values); + asm ("m0 =3D %1\n\t" + "v0 =3D vmem(%2 + #0)\n\t" + "v1 =3D vmem(%2 + #1)\n\t" + "v2 =3D vmem(%3 + #0)\n\t" + "v2.h =3D vshuff(v2.h)\n\t" /* shuffle the values for the scatte= r */ + "vscatter(%0, m0, v1:0.w).h =3D v2\n\t" + : : "r"(vtcm.vscatter16_32), "r"(region_len), + "r"(word_offsets), "r"(half_values) + : "m0", "v0", "v1", "v2", "memory"); =20 sync_scatter(vtcm.vscatter16_32); } =20 -/* scatter-acc the 16 bit elements with 32 bit offsets using intrinsics */ +/* scatter-accumulate the 16 bit elements with 32 bit offsets using HVX */ void vector_scatter_16_32_acc(void) { - HVX_VectorPair offsets; - HVX_Vector values; - - /* get the word offsets in a vector pair */ - offsets =3D *(HVX_VectorPair *)word_offsets; - - /* these values need to be shuffled for the scatter */ - values =3D *(HVX_Vector *)half_values_acc; - values =3D VSHUFF_H(values); - - VSCATTER_16_32_ACC(&vtcm.vscatter16_32, region_len, offsets, values); + asm ("m0 =3D %1\n\t" + "v0 =3D vmem(%2 + #0)\n\t" + "v1 =3D vmem(%2 + #1)\n\t" + "v2 =3D vmem(%3 + #0)\n\t" \ + "v2.h =3D vshuff(v2.h)\n\t" /* shuffle the values for the scatte= r */ + "vscatter(%0, m0, v1:0.w).h +=3D v2\n\t" + : : "r"(vtcm.vscatter16_32), "r"(region_len), + "r"(word_offsets), "r"(half_values_acc) + : "m0", "v0", "v1", "v2", "memory"); =20 sync_scatter(vtcm.vscatter16_32); } =20 -/* masked scatter the 16 bit elements with 32 bit offsets using intrinsics= */ +/* masked scatter the 16 bit elements with 32 bit offsets using HVX */ void vector_scatter_16_32_masked(void) { - HVX_VectorPair offsets; - HVX_Vector values; - HVX_Vector pred_reg; - - /* get the word offsets in a vector pair */ - offsets =3D *(HVX_VectorPair *)word_offsets; - - /* these values need to be shuffled for the scatter */ - values =3D *(HVX_Vector *)half_values_masked; - values =3D VSHUFF_H(values); - - pred_reg =3D *(HVX_Vector *)half_predicates; - pred_reg =3D VSHUFF_H(pred_reg); - HVX_VectorPred preds =3D VAND_VAL(pred_reg, ~0); - - VSCATTER_16_32_MASKED(preds, &vtcm.vscatter16_32, region_len, offsets, - values); + asm ("r1 =3D #-1\n\t" + "v0 =3D vmem(%0 + #0)\n\t" + "v0.h =3D vshuff(v0.h)\n\t" /* shuffle the predicates */ + "q0 =3D vand(v0, r1)\n\t" + "m0 =3D %2\n\t" + "v0 =3D vmem(%3 + #0)\n\t" + "v1 =3D vmem(%3 + #1)\n\t" + "v2 =3D vmem(%4 + #0)\n\t" \ + "v2.h =3D vshuff(v2.h)\n\t" /* shuffle the values for the scatte= r */ + "if (q0) vscatter(%1, m0, v1:0.w).h =3D v2\n\t" + : : "r"(half_predicates), "r"(vtcm.vscatter16_32), "r"(region_len= ), + "r"(word_offsets), "r"(half_values_masked) + : "r1", "q0", "m0", "v0", "v1", "v2", "memory"); =20 sync_scatter(vtcm.vscatter16_32); } =20 -/* gather the elements from the scatter16 buffer */ +/* gather the elements from the scatter16 buffer using HVX */ void vector_gather_16(void) { - HVX_Vector *vgather =3D (HVX_Vector *)&vtcm.vgather16; - HVX_Vector offsets =3D *(HVX_Vector *)half_offsets; - - VGATHER_16(vgather, &vtcm.vscatter16, region_len, offsets); - - sync_gather(vgather); + asm ("m0 =3D %1\n\t" + "v0 =3D vmem(%2 + #0)\n\t" + "{ vtmp.h =3D vgather(%0, m0, v0.h).h\n\t" + " vmem(%3 + #0) =3D vtmp.new }\n\t" + : : "r"(vtcm.vscatter16), "r"(region_len), + "r"(half_offsets), "r"(vtcm.vgather16) + : "m0", "v0", "memory"); + + sync_gather(vtcm.vgather16); } =20 static unsigned short gather_16_masked_init(void) @@ -427,31 +423,51 @@ static unsigned short gather_16_masked_init(void) return letter | (letter << 8); } =20 +/* masked gather the elements from the scatter16 buffer using HVX */ void vector_gather_16_masked(void) { - HVX_Vector *vgather =3D (HVX_Vector *)&vtcm.vgather16; - HVX_Vector offsets =3D *(HVX_Vector *)half_offsets; - HVX_Vector pred_reg =3D *(HVX_Vector *)half_predicates; - HVX_VectorPred preds =3D VAND_VAL(pred_reg, ~0); - - *vgather =3D VSPLAT_H(gather_16_masked_init()); - VGATHER_16_MASKED(vgather, preds, &vtcm.vscatter16, region_len, offset= s); - - sync_gather(vgather); + unsigned short init =3D gather_16_masked_init(); + + asm ("v0.h =3D vsplat(%5)\n\t" + "vmem(%4 + #0) =3D v0\n\t" /* initialize the write area */ + "r1 =3D #-1\n\t" + "v0 =3D vmem(%0 + #0)\n\t" + "q0 =3D vand(v0, r1)\n\t" + "m0 =3D %2\n\t" + "v0 =3D vmem(%3 + #0)\n\t" + "{ if (q0) vtmp.h =3D vgather(%1, m0, v0.h).h\n\t" + " vmem(%4 + #0) =3D vtmp.new }\n\t" + : : "r"(half_predicates), "r"(vtcm.vscatter16), "r"(region_len), + "r"(half_offsets), "r"(vtcm.vgather16), "r"(init) + : "r1", "q0", "m0", "v0", "memory"); + + sync_gather(vtcm.vgather16); } =20 -/* gather the elements from the scatter32 buffer */ +/* gather the elements from the scatter32 buffer using HVX */ void vector_gather_32(void) { - HVX_Vector *vgatherlo =3D (HVX_Vector *)&vtcm.vgather32; - HVX_Vector *vgatherhi =3D - (HVX_Vector *)((int)&vtcm.vgather32 + (MATRIX_SIZE * 2)); - HVX_Vector offsetslo =3D *(HVX_Vector *)word_offsets; - HVX_Vector offsetshi =3D *(HVX_Vector *)&word_offsets[MATRIX_SIZE / 2]; - - VGATHER_32(vgatherlo, &vtcm.vscatter32, region_len, offsetslo); - VGATHER_32(vgatherhi, &vtcm.vscatter32, region_len, offsetshi); + HVX_Vector *vgatherlo =3D (HVX_Vector *)vtcm.vgather32; + HVX_Vector *vgatherhi =3D (HVX_Vector *)&vtcm.vgather32[MATRIX_SIZE / = 2]; + HVX_Vector *offsetslo =3D (HVX_Vector *)word_offsets; + HVX_Vector *offsetshi =3D (HVX_Vector *)&word_offsets[MATRIX_SIZE / 2]; + + asm ("m0 =3D %1\n\t" + "v0 =3D vmem(%2 + #0)\n\t" + "{ vtmp.w =3D vgather(%0, m0, v0.w).w\n\t" + " vmem(%3 + #0) =3D vtmp.new }\n\t" + : : "r"(vtcm.vscatter32), "r"(region_len), + "r"(offsetslo), "r"(vgatherlo) + : "m0", "v0", "memory"); + asm ("m0 =3D %1\n\t" + "v0 =3D vmem(%2 + #0)\n\t" + "{ vtmp.w =3D vgather(%0, m0, v0.w).w\n\t" + " vmem(%3 + #0) =3D vtmp.new }\n\t" + : : "r"(vtcm.vscatter32), "r"(region_len), + "r"(offsetshi), "r"(vgatherhi) + : "m0", "v0", "memory"); =20 + sync_gather(vgatherlo); sync_gather(vgatherhi); } =20 @@ -461,79 +477,88 @@ static unsigned int gather_32_masked_init(void) return letter | (letter << 8) | (letter << 16) | (letter << 24); } =20 +/* masked gather the elements from the scatter32 buffer using HVX */ void vector_gather_32_masked(void) { - HVX_Vector *vgatherlo =3D (HVX_Vector *)&vtcm.vgather32; - HVX_Vector *vgatherhi =3D - (HVX_Vector *)((int)&vtcm.vgather32 + (MATRIX_SIZE * 2)); - HVX_Vector offsetslo =3D *(HVX_Vector *)word_offsets; - HVX_Vector offsetshi =3D *(HVX_Vector *)&word_offsets[MATRIX_SIZE / 2]; - HVX_Vector pred_reglo =3D *(HVX_Vector *)word_predicates; - HVX_VectorPred predslo =3D VAND_VAL(pred_reglo, ~0); - HVX_Vector pred_reghi =3D *(HVX_Vector *)&word_predicates[MATRIX_SIZE = / 2]; - HVX_VectorPred predshi =3D VAND_VAL(pred_reghi, ~0); - - *vgatherlo =3D VSPLAT_H(gather_32_masked_init()); - *vgatherhi =3D VSPLAT_H(gather_32_masked_init()); - VGATHER_32_MASKED(vgatherlo, predslo, &vtcm.vscatter32, region_len, - offsetslo); - VGATHER_32_MASKED(vgatherhi, predshi, &vtcm.vscatter32, region_len, - offsetshi); + unsigned int init =3D gather_32_masked_init(); + HVX_Vector *vgatherlo =3D (HVX_Vector *)vtcm.vgather32; + HVX_Vector *vgatherhi =3D (HVX_Vector *)&vtcm.vgather32[MATRIX_SIZE / = 2]; + HVX_Vector *offsetslo =3D (HVX_Vector *)word_offsets; + HVX_Vector *offsetshi =3D (HVX_Vector *)&word_offsets[MATRIX_SIZE / 2]; + HVX_Vector *predslo =3D (HVX_Vector *)word_predicates; + HVX_Vector *predshi =3D (HVX_Vector *)&word_predicates[MATRIX_SIZE / 2= ]; + + asm ("v0.h =3D vsplat(%5)\n\t" + "vmem(%4 + #0) =3D v0\n\t" /* initialize the write area */ + "r1 =3D #-1\n\t" + "v0 =3D vmem(%0 + #0)\n\t" + "q0 =3D vand(v0, r1)\n\t" + "m0 =3D %2\n\t" + "v0 =3D vmem(%3 + #0)\n\t" + "{ if (q0) vtmp.w =3D vgather(%1, m0, v0.w).w\n\t" + " vmem(%4 + #0) =3D vtmp.new }\n\t" + : : "r"(predslo), "r"(vtcm.vscatter32), "r"(region_len), + "r"(offsetslo), "r"(vgatherlo), "r"(init) + : "r1", "q0", "m0", "v0", "memory"); + asm ("v0.h =3D vsplat(%5)\n\t" + "vmem(%4 + #0) =3D v0\n\t" /* initialize the write area */ + "r1 =3D #-1\n\t" + "v0 =3D vmem(%0 + #0)\n\t" + "q0 =3D vand(v0, r1)\n\t" + "m0 =3D %2\n\t" + "v0 =3D vmem(%3 + #0)\n\t" + "{ if (q0) vtmp.w =3D vgather(%1, m0, v0.w).w\n\t" + " vmem(%4 + #0) =3D vtmp.new }\n\t" + : : "r"(predshi), "r"(vtcm.vscatter32), "r"(region_len), + "r"(offsetshi), "r"(vgatherhi), "r"(init) + : "r1", "q0", "m0", "v0", "memory"); =20 sync_gather(vgatherlo); sync_gather(vgatherhi); } =20 -/* gather the elements from the scatter16_32 buffer */ +/* gather the elements from the scatter16_32 buffer using HVX */ void vector_gather_16_32(void) { - HVX_Vector *vgather; - HVX_VectorPair offsets; - HVX_Vector values; - - /* get the vtcm address to gather from */ - vgather =3D (HVX_Vector *)&vtcm.vgather16_32; - - /* get the word offsets in a vector pair */ - offsets =3D *(HVX_VectorPair *)word_offsets; - - VGATHER_16_32(vgather, &vtcm.vscatter16_32, region_len, offsets); - - /* deal the elements to get the order back */ - values =3D *(HVX_Vector *)vgather; - values =3D VDEAL_H(values); - - /* write it back to vtcm address */ - *(HVX_Vector *)vgather =3D values; + asm ("m0 =3D %1\n\t" + "v0 =3D vmem(%2 + #0)\n\t" + "v1 =3D vmem(%2 + #1)\n\t" + "{ vtmp.h =3D vgather(%0, m0, v1:0.w).h\n\t" + " vmem(%3 + #0) =3D vtmp.new }\n\t" + "v0 =3D vmem(%3 + #0)\n\t" + "v0.h =3D vdeal(v0.h)\n\t" /* deal the elements to get the order= back */ + "vmem(%3 + #0) =3D v0\n\t" + : : "r"(vtcm.vscatter16_32), "r"(region_len), + "r"(word_offsets), "r"(vtcm.vgather16_32) + : "m0", "v0", "v1", "memory"); + + sync_gather(vtcm.vgather16_32); } =20 +/* masked gather the elements from the scatter16_32 buffer using HVX */ void vector_gather_16_32_masked(void) { - HVX_Vector *vgather; - HVX_VectorPair offsets; - HVX_Vector pred_reg; - HVX_VectorPred preds; - HVX_Vector values; - - /* get the vtcm address to gather from */ - vgather =3D (HVX_Vector *)&vtcm.vgather16_32; - - /* get the word offsets in a vector pair */ - offsets =3D *(HVX_VectorPair *)word_offsets; - pred_reg =3D *(HVX_Vector *)half_predicates; - pred_reg =3D VSHUFF_H(pred_reg); - preds =3D VAND_VAL(pred_reg, ~0); - - *vgather =3D VSPLAT_H(gather_16_masked_init()); - VGATHER_16_32_MASKED(vgather, preds, &vtcm.vscatter16_32, region_len, - offsets); - - /* deal the elements to get the order back */ - values =3D *(HVX_Vector *)vgather; - values =3D VDEAL_H(values); - - /* write it back to vtcm address */ - *(HVX_Vector *)vgather =3D values; + unsigned short init =3D gather_16_masked_init(); + + asm ("v0.h =3D vsplat(%5)\n\t" + "vmem(%4 + #0) =3D v0\n\t" /* initialize the write area */ + "r1 =3D #-1\n\t" + "v0 =3D vmem(%0 + #0)\n\t" + "v0.h =3D vshuff(v0.h)\n\t" /* shuffle the predicates */ + "q0 =3D vand(v0, r1)\n\t" + "m0 =3D %2\n\t" + "v0 =3D vmem(%3 + #0)\n\t" + "v1 =3D vmem(%3 + #1)\n\t" + "{ if (q0) vtmp.h =3D vgather(%1, m0, v1:0.w).h\n\t" + " vmem(%4 + #0) =3D vtmp.new }\n\t" + "v0 =3D vmem(%4 + #0)\n\t" + "v0.h =3D vdeal(v0.h)\n\t" /* deal the elements to get the order= back */ + "vmem(%4 + #0) =3D v0\n\t" + : : "r"(half_predicates), "r"(vtcm.vscatter16_32), "r"(region_len= ), + "r"(word_offsets), "r"(vtcm.vgather16_32), "r"(init) + : "r1", "q0", "m0", "v0", "v1", "memory"); + + sync_gather(vtcm.vgather16_32); } =20 static void check_buffer(const char *name, void *c, void *r, size_t size) @@ -579,6 +604,7 @@ void scalar_scatter_16_acc(unsigned short *vscatter16) } } =20 +/* scatter-accumulate the 16 bit elements using C */ void check_scatter_16_acc() { memset(vscatter16_ref, FILL_CHAR, @@ -589,7 +615,7 @@ void check_scatter_16_acc() SCATTER_BUFFER_SIZE * sizeof(unsigned short)); } =20 -/* scatter the 16 bit elements using C */ +/* masked scatter the 16 bit elements using C */ void scalar_scatter_16_masked(unsigned short *vscatter16) { for (int i =3D 0; i < MATRIX_SIZE; i++) { @@ -628,7 +654,7 @@ void check_scatter_32() SCATTER_BUFFER_SIZE * sizeof(unsigned int)); } =20 -/* scatter the 32 bit elements using C */ +/* scatter-accumulate the 32 bit elements using C */ void scalar_scatter_32_acc(unsigned int *vscatter32) { for (int i =3D 0; i < MATRIX_SIZE; ++i) { @@ -646,7 +672,7 @@ void check_scatter_32_acc() SCATTER_BUFFER_SIZE * sizeof(unsigned int)); } =20 -/* scatter the 32 bit elements using C */ +/* masked scatter the 32 bit elements using C */ void scalar_scatter_32_masked(unsigned int *vscatter32) { for (int i =3D 0; i < MATRIX_SIZE; i++) { @@ -667,7 +693,7 @@ void check_scatter_32_masked() SCATTER_BUFFER_SIZE * sizeof(unsigned int)); } =20 -/* scatter the 32 bit elements using C */ +/* scatter the 16 bit elements with 32 bit offsets using C */ void scalar_scatter_16_32(unsigned short *vscatter16_32) { for (int i =3D 0; i < MATRIX_SIZE; ++i) { @@ -684,7 +710,7 @@ void check_scatter_16_32() SCATTER_BUFFER_SIZE * sizeof(unsigned short)); } =20 -/* scatter the 32 bit elements using C */ +/* scatter-accumulate the 16 bit elements with 32 bit offsets using C */ void scalar_scatter_16_32_acc(unsigned short *vscatter16_32) { for (int i =3D 0; i < MATRIX_SIZE; ++i) { @@ -702,6 +728,7 @@ void check_scatter_16_32_acc() SCATTER_BUFFER_SIZE * sizeof(unsigned short)); } =20 +/* masked scatter the 16 bit elements with 32 bit offsets using C */ void scalar_scatter_16_32_masked(unsigned short *vscatter16_32) { for (int i =3D 0; i < MATRIX_SIZE; i++) { @@ -738,6 +765,7 @@ void check_gather_16() MATRIX_SIZE * sizeof(unsigned short)); } =20 +/* masked gather the elements from the scatter buffer using C */ void scalar_gather_16_masked(unsigned short *vgather16) { for (int i =3D 0; i < MATRIX_SIZE; ++i) { @@ -756,7 +784,7 @@ void check_gather_16_masked() MATRIX_SIZE * sizeof(unsigned short)); } =20 -/* gather the elements from the scatter buffer using C */ +/* gather the elements from the scatter32 buffer using C */ void scalar_gather_32(unsigned int *vgather32) { for (int i =3D 0; i < MATRIX_SIZE; ++i) { @@ -772,6 +800,7 @@ void check_gather_32(void) MATRIX_SIZE * sizeof(unsigned int)); } =20 +/* masked gather the elements from the scatter32 buffer using C */ void scalar_gather_32_masked(unsigned int *vgather32) { for (int i =3D 0; i < MATRIX_SIZE; ++i) { @@ -781,7 +810,6 @@ void scalar_gather_32_masked(unsigned int *vgather32) } } =20 - void check_gather_32_masked(void) { memset(vgather32_ref, gather_32_masked_init(), @@ -791,7 +819,7 @@ void check_gather_32_masked(void) vgather32_ref, MATRIX_SIZE * sizeof(unsigned int)); } =20 -/* gather the elements from the scatter buffer using C */ +/* gather the elements from the scatter16_32 buffer using C */ void scalar_gather_16_32(unsigned short *vgather16_32) { for (int i =3D 0; i < MATRIX_SIZE; ++i) { @@ -807,6 +835,7 @@ void check_gather_16_32(void) MATRIX_SIZE * sizeof(unsigned short)); } =20 +/* masked gather the elements from the scatter16_32 buffer using C */ void scalar_gather_16_32_masked(unsigned short *vgather16_32) { for (int i =3D 0; i < MATRIX_SIZE; ++i) { --=20 2.17.1