Sign Up
Log In
Log In
or
Sign Up
Places
All Projects
Status Monitor
Collapse sidebar
hardware
rasdaemon
rasdaemon-0.8.0.49.git+f9cb13b.obscpio
Overview
Repositories
Revisions
Requests
Users
Attributes
Meta
File rasdaemon-0.8.0.49.git+f9cb13b.obscpio of Package rasdaemon
07070100000000000041ED00000000000000000000000365C04BE400000000000000000000000000000000000000000000002700000000rasdaemon-0.8.0.49.git+f9cb13b/.github07070100000001000041ED00000000000000000000000265C04BE400000000000000000000000000000000000000000000003100000000rasdaemon-0.8.0.49.git+f9cb13b/.github/workflows07070100000002000081A400000000000000000000000165C04BE400000222000000000000000000000000000000000000003800000000rasdaemon-0.8.0.49.git+f9cb13b/.github/workflows/ci.ymlname: CI on: workflow_dispatch: push: pull_request: jobs: Ubuntu: name: Ubuntu runs-on: ubuntu-latest strategy: matrix: arch: [x64_64, aarch64, ppc64le] steps: - uses: actions/checkout@v2 - name: prepare run: | sudo apt-get update sudo apt-get install -y build-essential libsqlite3-dev sqlite3 libtraceevent-dev libtraceevent1 - name: build run: | autoreconf -vfi ./configure --enable-all make sudo make install 07070100000003000081ED00000000000000000000000165C04BE40000030C000000000000000000000000000000000000004000000000rasdaemon-0.8.0.49.git+f9cb13b/.github/workflows/gen_release.pl#!/usr/bin/perl my $body_path = shift or die "Need a file name to store the release body"; my $ver; open IN, "configure.ac" or die; while (<IN>) { if (m/^[^\#]*AC_INIT\s*\(\s*\[\s*RASdaemon\s*\]\s*,\s*\[?(\d+[\.\d]+)/) { $ver=$1; last; } } close IN or die "can't open configure.ac"; die "Can't get version from configure.ac" if (!$ver); sub gen_version() { print "$ver\n"; open IN, "ChangeLog" or return "error opening ChangeLog"; open OUT, ">$body_path" or return "error creating $body_path"; while (<IN>) { last if (m/$ver/); } while (<IN>) { next if (m/^$/); last if (m/^\S/); my $ln = $_; $ln =~ s/^\s+\*/-/; print OUT $ln; } close OUT or return "error closing $body_path"; return ""; } my $ret = gen_version(); die($ret) if ($ret ne ""); 07070100000004000081A400000000000000000000000165C04BE40000061C000000000000000000000000000000000000003C00000000rasdaemon-0.8.0.49.git+f9cb13b/.github/workflows/on_tag.ymlname: Create release on tag on: workflow_dispatch: push: # Sequence of patterns matched against refs/tags tags: - 'v[0-9]+*' jobs: release: name: Create Release runs-on: ubuntu-latest outputs: upload_url: ${{ steps.create_release.outputs.upload_url }} steps: - uses: actions/checkout@v2 - name: Release changelog run: | .github/workflows/gen_release.pl body_file.tmp > version echo "version=$(cat version)" >> $GITHUB_ENV - name: Create Release id: create_release uses: actions/create-release@latest env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} with: tag_name: ${{ github.ref }} release_name: Release ${{ github.ref }} body_path: body_file.tmp draft: false prerelease: true - name: prepare run: | sudo apt-get update sudo apt-get install -y build-essential sqlite3 libtraceevent-dev libtraceevent1 - name: Create Source Package for version ${{ env.version }} run: | autoreconf -vfi ./configure --enable-all make dist-bzip2 - name: upload env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} uses: mchehab/upload-release-asset@v1.0.3 with: upload_url: ${{ github.event.release.upload_url }} asset_path: rasdaemon-${{ env.version }}.tar.bz2 asset_name: rasdaemon-${{ env.version }}.tar.bz2 asset_content_type: application/bzip2 07070100000005000081A400000000000000000000000165C04BE40000015C000000000000000000000000000000000000002A00000000rasdaemon-0.8.0.49.git+f9cb13b/.gitignore.deps/ autom4te.cache/ SRPMS/ misc/rasdaemon.spec misc/ras-mc-ctl.service misc/rasdaemon.service Makefile Makefile.in compile config.h config.h.in config.h.in~ stamp-h1 aclocal.m4 config.guess config.log config.status config.sub configure depcomp install-sh libtool ltmain.sh missing rasdaemon *.o *.c~ *.h~ rasdaemon-*.tar.bz2 rasdaemon-*.src.rpm 07070100000006000081A400000000000000000000000165C04BE400000254000000000000000000000000000000000000002B00000000rasdaemon-0.8.0.49.git+f9cb13b/.travis.ymllanguage: cpp compiler: gcc dist: bionic notifications: email: recipients: - mchehab@kernel.org on_success: change on_failure: always cache: directories: - $HOME/.ccache - $HOME/pbuilder-bases matrix: include: - env: TARGET_OS=bionic - compiler: clang env: TARGET_OS=bionic #powerjobs - env: TARGET_OS=bionic arch: ppc64le - compiler: clang arch: ppc64le env: TARGET_OS=bionic before_install: - sudo apt-get install -y sqlite3 install: - autoreconf -vfi - ./configure --enable-all script: - make && sudo make install after_script: - ccache -s 07070100000007000081A400000000000000000000000165C04BE40000002C000000000000000000000000000000000000002700000000rasdaemon-0.8.0.49.git+f9cb13b/AUTHORSMauro Carvalho Chehab <mchehab@kernel.org> 07070100000008000081A400000000000000000000000165C04BE4000040ED000000000000000000000000000000000000002700000000rasdaemon-0.8.0.49.git+f9cb13b/COPYING GNU GENERAL PUBLIC LICENSE Version 2, June 1991 Copyright (C) 1989, 1991 Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Lesser General Public License instead.) You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it. For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software. Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations. Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all. The precise terms and conditions for copying, distribution and modification follow. GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term "modification".) Each licensee is addressed as "you". Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does. 1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program. You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. 2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions: a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change. b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License. c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.) These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it. Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program. In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. 3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following: a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.) The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable. If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code. 4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. 5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it. 6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License. 7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program. If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances. It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. 8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License. 9. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation. 10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. <one line to give the program's name and a brief idea of what it does.> Copyright (C) <year> <name of author> This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. 07070100000009000081A400000000000000000000000165C04BE400001FDF000000000000000000000000000000000000002900000000rasdaemon-0.8.0.49.git+f9cb13b/ChangeLog2023-02-18 Mauro Carvalho Chehab <mchehab@kernel.org> - Version 0.8.0 * This version now uses libtraceevent. Since its beginning, rasdaemon came with an early version of this library. Now, instead of keeping it embedded, use it from the system's package. * Fix mock build target and rasdaemon.spec.in * Update README with instructions about contributing and convert to markdown * Fix a regression with Kernel 6.1-rc6 * make distcheck now works * Add labels for ASRockRack model X399D8A-2T 2023-01-21 Mauro Carvalho Chehab <mchehab@kernel.org> - Version 0.7.0 * Add labels for ASUS TUF GAMING B450-PLUS II * Add four modules supported by HiSilicon common section * Updated HiSilicon platform name * Relocate reading and display Kunpeng920 errors to under Kunpeng9xx * Add support to display the HiSilicon vendor errors for a specified module * Add printing usage if necessary parameters are not passed for the vendor-error options * Reformat error info of the HiSilicon Kunpeng920 * Modify error statistics for HiSilicon KunPeng9xx common errors * Modify recording Hisilicon common error data * Support cpu fault isolation for recoverable errors * Support cpu fault isolation for corrected errors * Use XSI version of strerror_r on non glibc systems * Use the new block_rq_error tracepoint * Fix bank limit types check * Properly handle localtime() failure * Fix for a memory out-of-bounds issue and optimized code to remove duplicate function. * Fix possible but unlikely file descriptor leak * Fix bashisms 2022-04-12 Mauro Carvalho Chehab <mchehab@kernel.org> - Version 0.6.8 * Fix some issues related to sysconfigdir * Some fixes for hisi boards * Update ras-mc-ctl manpage to match current options * Fix ras-mc-ctl when parsing some dimm sizes * New asrock x570 motherboard label * New Supermicro labels * Support MCE for AMD CPU family 19h * Add new SMCA bank types with error decoding * Add error handling for Ampere-specific errors. * Add support for multi-arch builds 2021-05-26 Mauro Carvalho Chehab <mchehab+huawei@kernel.org> - Version 0.6.7 * Support for Ice Lake and Sapphire Rapids * Support for HiSilicon Kunpeng9xx * Support for Ampere * Support for memory failure events * Support for ARM processor error information * Support for decoding for new SMCA Load Store bank type * Add 8 channel decoding for SMCA systems * Improvements at the page isolation logic * New labels: A2SDi-8C-HLN4F, A2SDi-8C+-HLN4F, ASUS PRIME X570-PRO * New labels: Supermicro X10SRA-F and H8DGU * Added support to specify SYSCONFDEFDIR * RASSTATEDIR is now created at runtime * Use a linked list for non-standard error decoding interface * PCIe AER now displas PCIe dev name * Fixed a memory leak * Several fixes * Added ppc64le to travis build 2020-07-21 Mauro Carvalho Chehab <mchehab+huawei@kernel.org> - Version 0.6.6 * Support for new AMD SMCA bank types * Add decoders for more hip08 events * Add support for memory Corrected Error predictive failure analysis * Some bugs fixed 2019-11-20 Mauro Carvalho Chehab <mchehab+huawei@kernel.org> - Version 0.6.5 * Several fixes for error handling logic * Alter tables on SQL in case of errors during update * store PCIe dev name and TLP header for the aer event 2019-10-10 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> - Version 0.6.4 * Change DB for hip08 in order to better handle some OEM data * Fix an issue of sqlite3 integer bind parameter mismatch * Update instructions about sending patches * Fix URLs to git.kernel.org repositories in README file * Fix file descriptor leak in ras-report.c:setup_report_socket() * Initialize record.cpu before pevent_print_event(). * Flush trace buffer immediately, not on next call * Replace whitespaces by tabs * Fix build with musl 2019-08-23 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> - Version 0.6.3 * Added support for ARM Scalable MCA * Added support for HiSilicon HIP08 * Added support for Hygon Dhyana family 18h processor * Added support for disk I/O error monitoring * Added devlink events * Integrate rasdaemon build tests with Travis CI * Fixed asdaemon high CPU usage when part of CPUs offline * Fixed mcgstatus message print * Some other minor fixes 2018-08-14 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> - Version 0.6.2 * Update INSTALL from the auto-tools generated one * Reorder this ChangeLog place new stuff at the beginning * add option to show error counts at ras-mc-ctl * Do some new gcc 8.1 warning cleanups * Use separate string array for PCIe AER error status * Fix PCIe AER error type 2018-04-25 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> - Version 0.6.1 * Update DIMM labels for 2-socket servers * Add Skylake Xeon MSCOD values * ARM: fully initialize ras_arm_event * Update my email 2017-10-14 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> - Version 0.6.0 * Added support for non-standard CPER error sections * Added support for Hisilicon HIP07 SAS HW module * Added support for ARM events * Updated DIMM labels for Intel Skylake servers 2016-06-08 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> - Version 0.5.9 * Add Knights Mill and updated DELL labels * Configure now reports enabled options 2016-04-15 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> - Version 0.5.8 * Add Broadwell EP/EX MSCOD and Broadwell DE MSCOD values 2016-02-05 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> - Version 0.5.7 * Add model numbers for Broadwell-EP/EX and -DE * Add support for Knights Landing processor 2015-07-03 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> - Version 0.5.6 * Add internal errors of IA32_MC4_STATUS for Haswell * Use MCA error msg as error_msg * Unnecessary comma for empty mc_location string * Remove a space from mcgstatus_msg * Add support to log Local Machine Check Exception (LMCE) 2015-06-03 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> - Version 0.5.5 * Improve INSTALL summary instructions * Add support to match the machine by system's product name * Add support for Haswell/Broadwell/Knights Landing * Some bug fixes on some MCE handlers 2014-08-15 Mauro Carvalho Chehab <m.chehab@samsung.com> - Version 0.5.4 * Fix a bug while parsing dimm labels on amd64 * Enable database recording by default on systemd service file * Correct range while parsing top, middle and lower layers 2014-08-10 Mauro Carvalho Chehab <m.chehab@samsung.com> - Version 0.5.3 * Add support for extlog trace events * Some fixes affecting sqlite handling * Handle failures of snprintf() * Fix mce numfield decoded error 2014-04-03 Mauro Carvalho Chehab <m.chehab@samsung.com> - Version 0.5.2 * Some fixes for ABRT report support 2014-03-25 Mauro Carvalho Chehab <m.chehab@samsung.com> - Version 0.5.1 * Fix patches at *.service files * Some fixes and documentation for --record option 2014-02-16 Mauro Carvalho Chehab <m.chehab@samsung.com> - Version 0.5.0 * Initial ABRT support 2013-09-10 Mauro Carvalho Chehab <m.chehab@samsung.com> - Version 0.4.2 * Fixes ras-mc-ctl layout 2013-05-29 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> - Version 0.4.1 * Some fixes, mostly at sqlite3 code * Add support at ras-mc-ctl to query database 2013-05-28 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> - Version 0.4.0 * Several fixes * Get rid of pthreads, to avoid troubles with sqllite3 (requires Kernel 3.10 or upper) * Add memory error decoding on MCE traces 2013-05-20 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> - Version 0.3.0 * Several fixes * Add support for MCE traces * Add support for PCI AER traces * Add a target to build it on rpm-based distros 2013-05-08 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> - Version 0.2.0 * Add support to log via syslog * Add ras-mc-ctl script to handle dimm labels * Add a rpm spec file * Make sqlite3 code experimental * Add manpages and systemd services * Update to take advantage of tracing features on Kernel 3.10 2013-03-12 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> - Version 0.1.0 * Initial version 0707010000000A000081A400000000000000000000000165C04BE400003D96000000000000000000000000000000000000002700000000rasdaemon-0.8.0.49.git+f9cb13b/INSTALLInstallation Instructions ************************* Copyright (C) 1994-1996, 1999-2002, 2004-2017, 2020-2021 Free Software Foundation, Inc. Copying and distribution of this file, with or without modification, are permitted in any medium without royalty provided the copyright notice and this notice are preserved. This file is offered as-is, without warranty of any kind. Basic Installation ================== Briefly, the shell command './configure && make && make install' should configure, build, and install this package. The following more-detailed instructions are generic; see the 'README' file for instructions specific to this package. Some packages provide this 'INSTALL' file but do not implement all of the features documented below. The lack of an optional feature in a given package is not necessarily a bug. More recommendations for GNU packages can be found in *note Makefile Conventions: (standards)Makefile Conventions. The 'configure' shell script attempts to guess correct values for various system-dependent variables used during compilation. It uses those values to create a 'Makefile' in each directory of the package. It may also create one or more '.h' files containing system-dependent definitions. Finally, it creates a shell script 'config.status' that you can run in the future to recreate the current configuration, and a file 'config.log' containing compiler output (useful mainly for debugging 'configure'). It can also use an optional file (typically called 'config.cache' and enabled with '--cache-file=config.cache' or simply '-C') that saves the results of its tests to speed up reconfiguring. Caching is disabled by default to prevent problems with accidental use of stale cache files. If you need to do unusual things to compile the package, please try to figure out how 'configure' could check whether to do them, and mail diffs or instructions to the address given in the 'README' so they can be considered for the next release. If you are using the cache, and at some point 'config.cache' contains results you don't want to keep, you may remove or edit it. The file 'configure.ac' (or 'configure.in') is used to create 'configure' by a program called 'autoconf'. You need 'configure.ac' if you want to change it or regenerate 'configure' using a newer version of 'autoconf'. The simplest way to compile this package is: 1. 'cd' to the directory containing the package's source code and type './configure' to configure the package for your system. Running 'configure' might take a while. While running, it prints some messages telling which features it is checking for. 2. Type 'make' to compile the package. 3. Optionally, type 'make check' to run any self-tests that come with the package, generally using the just-built uninstalled binaries. 4. Type 'make install' to install the programs and any data files and documentation. When installing into a prefix owned by root, it is recommended that the package be configured and built as a regular user, and only the 'make install' phase executed with root privileges. 5. Optionally, type 'make installcheck' to repeat any self-tests, but this time using the binaries in their final installed location. This target does not install anything. Running this target as a regular user, particularly if the prior 'make install' required root privileges, verifies that the installation completed correctly. 6. You can remove the program binaries and object files from the source code directory by typing 'make clean'. To also remove the files that 'configure' created (so you can compile the package for a different kind of computer), type 'make distclean'. There is also a 'make maintainer-clean' target, but that is intended mainly for the package's developers. If you use it, you may have to get all sorts of other programs in order to regenerate files that came with the distribution. 7. Often, you can also type 'make uninstall' to remove the installed files again. In practice, not all packages have tested that uninstallation works correctly, even though it is required by the GNU Coding Standards. 8. Some packages, particularly those that use Automake, provide 'make distcheck', which can by used by developers to test that all other targets like 'make install' and 'make uninstall' work correctly. This target is generally not run by end users. Compilers and Options ===================== Some systems require unusual options for compilation or linking that the 'configure' script does not know about. Run './configure --help' for details on some of the pertinent environment variables. You can give 'configure' initial values for configuration parameters by setting variables in the command line or in the environment. Here is an example: ./configure CC=c99 CFLAGS=-g LIBS=-lposix *Note Defining Variables::, for more details. Compiling For Multiple Architectures ==================================== You can compile the package for more than one kind of computer at the same time, by placing the object files for each architecture in their own directory. To do this, you can use GNU 'make'. 'cd' to the directory where you want the object files and executables to go and run the 'configure' script. 'configure' automatically checks for the source code in the directory that 'configure' is in and in '..'. This is known as a "VPATH" build. With a non-GNU 'make', it is safer to compile the package for one architecture at a time in the source code directory. After you have installed the package for one architecture, use 'make distclean' before reconfiguring for another architecture. On MacOS X 10.5 and later systems, you can create libraries and executables that work on multiple system types--known as "fat" or "universal" binaries--by specifying multiple '-arch' options to the compiler but only a single '-arch' option to the preprocessor. Like this: ./configure CC="gcc -arch i386 -arch x86_64 -arch ppc -arch ppc64" \ CXX="g++ -arch i386 -arch x86_64 -arch ppc -arch ppc64" \ CPP="gcc -E" CXXCPP="g++ -E" This is not guaranteed to produce working output in all cases, you may have to build one architecture at a time and combine the results using the 'lipo' tool if you have problems. Installation Names ================== By default, 'make install' installs the package's commands under '/usr/local/bin', include files under '/usr/local/include', etc. You can specify an installation prefix other than '/usr/local' by giving 'configure' the option '--prefix=PREFIX', where PREFIX must be an absolute file name. You can specify separate installation prefixes for architecture-specific files and architecture-independent files. If you pass the option '--exec-prefix=PREFIX' to 'configure', the package uses PREFIX as the prefix for installing programs and libraries. Documentation and other data files still use the regular prefix. In addition, if you use an unusual directory layout you can give options like '--bindir=DIR' to specify different values for particular kinds of files. Run 'configure --help' for a list of the directories you can set and what kinds of files go in them. In general, the default for these options is expressed in terms of '${prefix}', so that specifying just '--prefix' will affect all of the other directory specifications that were not explicitly provided. The most portable way to affect installation locations is to pass the correct locations to 'configure'; however, many packages provide one or both of the following shortcuts of passing variable assignments to the 'make install' command line to change installation locations without having to reconfigure or recompile. The first method involves providing an override variable for each affected directory. For example, 'make install prefix=/alternate/directory' will choose an alternate location for all directory configuration variables that were expressed in terms of '${prefix}'. Any directories that were specified during 'configure', but not in terms of '${prefix}', must each be overridden at install time for the entire installation to be relocated. The approach of makefile variable overrides for each directory variable is required by the GNU Coding Standards, and ideally causes no recompilation. However, some platforms have known limitations with the semantics of shared libraries that end up requiring recompilation when using this method, particularly noticeable in packages that use GNU Libtool. The second method involves providing the 'DESTDIR' variable. For example, 'make install DESTDIR=/alternate/directory' will prepend '/alternate/directory' before all installation names. The approach of 'DESTDIR' overrides is not required by the GNU Coding Standards, and does not work on platforms that have drive letters. On the other hand, it does better at avoiding recompilation issues, and works well even when some directory options were not specified in terms of '${prefix}' at 'configure' time. Optional Features ================= If the package supports it, you can cause programs to be installed with an extra prefix or suffix on their names by giving 'configure' the option '--program-prefix=PREFIX' or '--program-suffix=SUFFIX'. Some packages pay attention to '--enable-FEATURE' options to 'configure', where FEATURE indicates an optional part of the package. They may also pay attention to '--with-PACKAGE' options, where PACKAGE is something like 'gnu-as' or 'x' (for the X Window System). The 'README' should mention any '--enable-' and '--with-' options that the package recognizes. For packages that use the X Window System, 'configure' can usually find the X include and library files automatically, but if it doesn't, you can use the 'configure' options '--x-includes=DIR' and '--x-libraries=DIR' to specify their locations. Some packages offer the ability to configure how verbose the execution of 'make' will be. For these packages, running './configure --enable-silent-rules' sets the default to minimal output, which can be overridden with 'make V=1'; while running './configure --disable-silent-rules' sets the default to verbose, which can be overridden with 'make V=0'. Particular systems ================== On HP-UX, the default C compiler is not ANSI C compatible. If GNU CC is not installed, it is recommended to use the following options in order to use an ANSI C compiler: ./configure CC="cc -Ae -D_XOPEN_SOURCE=500" and if that doesn't work, install pre-built binaries of GCC for HP-UX. HP-UX 'make' updates targets which have the same timestamps as their prerequisites, which makes it generally unusable when shipped generated files such as 'configure' are involved. Use GNU 'make' instead. On OSF/1 a.k.a. Tru64, some versions of the default C compiler cannot parse its '<wchar.h>' header file. The option '-nodtk' can be used as a workaround. If GNU CC is not installed, it is therefore recommended to try ./configure CC="cc" and if that doesn't work, try ./configure CC="cc -nodtk" On Solaris, don't put '/usr/ucb' early in your 'PATH'. This directory contains several dysfunctional programs; working variants of these programs are available in '/usr/bin'. So, if you need '/usr/ucb' in your 'PATH', put it _after_ '/usr/bin'. On Haiku, software installed for all users goes in '/boot/common', not '/usr/local'. It is recommended to use the following options: ./configure --prefix=/boot/common Specifying the System Type ========================== There may be some features 'configure' cannot figure out automatically, but needs to determine by the type of machine the package will run on. Usually, assuming the package is built to be run on the _same_ architectures, 'configure' can figure that out, but if it prints a message saying it cannot guess the machine type, give it the '--build=TYPE' option. TYPE can either be a short name for the system type, such as 'sun4', or a canonical name which has the form: CPU-COMPANY-SYSTEM where SYSTEM can have one of these forms: OS KERNEL-OS See the file 'config.sub' for the possible values of each field. If 'config.sub' isn't included in this package, then this package doesn't need to know the machine type. If you are _building_ compiler tools for cross-compiling, you should use the option '--target=TYPE' to select the type of system they will produce code for. If you want to _use_ a cross compiler, that generates code for a platform different from the build platform, you should specify the "host" platform (i.e., that on which the generated programs will eventually be run) with '--host=TYPE'. Sharing Defaults ================ If you want to set default values for 'configure' scripts to share, you can create a site shell script called 'config.site' that gives default values for variables like 'CC', 'cache_file', and 'prefix'. 'configure' looks for 'PREFIX/share/config.site' if it exists, then 'PREFIX/etc/config.site' if it exists. Or, you can set the 'CONFIG_SITE' environment variable to the location of the site script. A warning: not all 'configure' scripts look for a site script. Defining Variables ================== Variables not defined in a site shell script can be set in the environment passed to 'configure'. However, some packages may run configure again during the build, and the customized values of these variables may be lost. In order to avoid this problem, you should set them in the 'configure' command line, using 'VAR=value'. For example: ./configure CC=/usr/local2/bin/gcc causes the specified 'gcc' to be used as the C compiler (unless it is overridden in the site shell script). Unfortunately, this technique does not work for 'CONFIG_SHELL' due to an Autoconf limitation. Until the limitation is lifted, you can use this workaround: CONFIG_SHELL=/bin/bash ./configure CONFIG_SHELL=/bin/bash 'configure' Invocation ====================== 'configure' recognizes the following options to control how it operates. '--help' '-h' Print a summary of all of the options to 'configure', and exit. '--help=short' '--help=recursive' Print a summary of the options unique to this package's 'configure', and exit. The 'short' variant lists options used only in the top level, while the 'recursive' variant lists options also present in any nested packages. '--version' '-V' Print the version of Autoconf used to generate the 'configure' script, and exit. '--cache-file=FILE' Enable the cache: use and save the results of the tests in FILE, traditionally 'config.cache'. FILE defaults to '/dev/null' to disable caching. '--config-cache' '-C' Alias for '--cache-file=config.cache'. '--quiet' '--silent' '-q' Do not print messages saying which checks are being made. To suppress all normal output, redirect it to '/dev/null' (any error messages will still be shown). '--srcdir=DIR' Look for the package's source code in directory DIR. Usually 'configure' can determine that directory automatically. '--prefix=DIR' Use DIR as the installation prefix. *note Installation Names:: for more details, including other options available for fine-tuning the installation locations. '--no-create' '-n' Run the configure checks, but stop before creating any output files. 'configure' also accepts some other, not widely useful, options. Run 'configure --help' for more details. 0707010000000B000081A400000000000000000000000165C04BE400001163000000000000000000000000000000000000002B00000000rasdaemon-0.8.0.49.git+f9cb13b/Makefile.amAM_DISTCHECK_CONFIGURE_FLAGS = --enable-all ACLOCAL_AMFLAGS=-I m4 SUBDIRS = util man SYSTEMD_SERVICES_IN = misc/rasdaemon.service.in misc/ras-mc-ctl.service.in SYSTEMD_SERVICES = $(SYSTEMD_SERVICES_IN:.service.in=.service) EXTRA_DIST = $(SYSTEMD_SERVICES_IN) misc/rasdaemon.env CLEANFILES= \ misc/ras-mc-ctl.service \ misc/rasdaemon.service DISTCLEANFILES = misc/rasdaemon.spec # This rule is needed because \@sbindir\@ is expanded to \${exec_prefix\}/sbin # during ./configure phase, therefore it is not possible to add .service.in # files to AC_CONFIG_FILES in configure.ac SUFFIXES = .service.in .service .service.in.service: sed -e s,\@sbindir\@,$(sbindir),g -e s,\@SYSCONFDEFDIR\@,@SYSCONFDEFDIR@,g $< > $@ # This rule is needed because the service files must be generated on target # system after ./configure phase all-local: $(SYSTEMD_SERVICES) sbin_PROGRAMS = rasdaemon rasdaemon_SOURCES = rasdaemon.c ras-events.c ras-mc-handler.c \ bitfield.c if WITH_SQLITE3 rasdaemon_SOURCES += ras-record.c endif if WITH_AER rasdaemon_SOURCES += ras-aer-handler.c endif if WITH_NON_STANDARD rasdaemon_SOURCES += ras-non-standard-handler.c endif if WITH_ARM rasdaemon_SOURCES += ras-arm-handler.c endif if WITH_MCE rasdaemon_SOURCES += ras-mce-handler.c mce-intel.c mce-amd.c \ mce-intel-p4-p6.c mce-intel-nehalem.c \ mce-intel-dunnington.c mce-intel-tulsa.c \ mce-intel-sb.c mce-intel-ivb.c mce-intel-haswell.c \ mce-intel-knl.c mce-intel-broadwell-de.c \ mce-intel-broadwell-epex.c mce-intel-skylake-xeon.c \ mce-amd-k8.c mce-amd-smca.c mce-intel-i10nm.c endif if WITH_EXTLOG rasdaemon_SOURCES += ras-extlog-handler.c endif if WITH_DEVLINK rasdaemon_SOURCES += ras-devlink-handler.c endif if WITH_DISKERROR rasdaemon_SOURCES += ras-diskerror-handler.c endif if WITH_MEMORY_FAILURE rasdaemon_SOURCES += ras-memory-failure-handler.c endif if WITH_ABRT_REPORT rasdaemon_SOURCES += ras-report.c endif if WITH_HISI_NS_DECODE rasdaemon_SOURCES += non-standard-hisi_hip08.c non-standard-hisilicon.c endif if WITH_MEMORY_CE_PFA rasdaemon_SOURCES += rbtree.c ras-page-isolation.c endif if WITH_AMP_NS_DECODE rasdaemon_SOURCES += non-standard-ampere.c endif if WITH_CPU_FAULT_ISOLATION rasdaemon_SOURCES += ras-cpu-isolation.c queue.c endif if WITH_CXL rasdaemon_SOURCES += ras-cxl-handler.c endif if WITH_YITIAN_NS_DECODE rasdaemon_SOURCES += non-standard-yitian.c endif if WITH_JAGUAR_NS_DECODE rasdaemon_SOURCES += non-standard-jaguarmicro.c endif rasdaemon_LDADD = -lpthread $(SQLITE3_LIBS) $(LIBTRACEEVENT_LIBS) rasdaemon_CFLAGS = $(SQLITE3_CFLAGS) $(LIBTRACEEVENT_CFLAGS) include_HEADERS = config.h ras-events.h ras-logger.h ras-mc-handler.h \ ras-aer-handler.h ras-mce-handler.h ras-record.h bitfield.h ras-report.h \ ras-extlog-handler.h ras-arm-handler.h ras-non-standard-handler.h \ ras-devlink-handler.h ras-diskerror-handler.h rbtree.h ras-page-isolation.h \ non-standard-hisilicon.h non-standard-ampere.h ras-memory-failure-handler.h \ ras-cxl-handler.h ras-cpu-isolation.h queue.h non-standard-yitian.h \ non-standard-jaguarmicro.h # This rule can't be called with more than one Makefile job (like make -j8) # I can't figure out a way to fix that dist-rpm: dist-bzip2 if [ ! -d "`rpm --eval %{_topdir}`/SOURCES/" ]; then mkdir "`rpm --eval %{_topdir}`/SOURCES/"; fi cp @PACKAGE@-@PACKAGE_VERSION@.tar.bz2 `rpm --eval %{_topdir}`/SOURCES/ rpmbuild -ba misc/@PACKAGE@.spec cp `rpm --eval %{_topdir}`/SRPMS/@PACKAGE@-@PACKAGE_VERSION@*.src.rpm . srpm: dist-bzip2 if [ ! -d "`rpm --eval %{_topdir}`/SOURCES/" ]; then mkdir "`rpm --eval %{_topdir}`/SOURCES/"; fi cp @PACKAGE@-@PACKAGE_VERSION@.tar.bz2 `rpm --eval %{_topdir}`/SOURCES/ rpmbuild -bs misc/@PACKAGE@.spec mock: srpm mock --resultdir="./SRPMS" `rpm --eval %{_topdir}`/SRPMS/@PACKAGE@-@PACKAGE_VERSION@*.src.rpm rpmlint: rpmlint misc/@PACKAGE@.spec `rpm --eval %{_topdir}`/SRPMS/@PACKAGE@-@PACKAGE_VERSION@*.src.rpm `rpm --eval %{_topdir}`/RPMS/*/@PACKAGE@-@PACKAGE_VERSION@*.rpm upload: scp `rpm --eval %{_topdir}`/SRPMS/@PACKAGE@-@PACKAGE_VERSION@*.src.rpm @PACKAGE@-@PACKAGE_VERSION@.tar.bz2 misc/rasdaemon.spec www.infradead.org:public_html/rasdaemon # custom target install-data-local: $(install_sh) -d "$(DESTDIR)@sysconfdir@/ras/dimm_labels.d" if WITH_MEMORY_CE_PFA $(install_sh) @abs_srcdir@/misc/rasdaemon.env "$(DESTDIR)@SYSCONFDEFDIR@/rasdaemon" endif 0707010000000C000081A400000000000000000000000165C04BE400000B3B000000000000000000000000000000000000002400000000rasdaemon-0.8.0.49.git+f9cb13b/NEWSRAS DAEMON ========== In Kernel 3.5 we've started to address the long-discussed need of having a better way to handle platform Reliability, Availability and Serviceability (RAS). Basically, a tracepoint event that handles memory errors called ras:mc_event was added there, together with HERM/EDAC version 3.0 patches. In Kernel 3.8, a new event was added, to handle PCIe AER events (ras:aer_event) [1]. On kernel 3.9, a new driver was added to report hardware memory errors that comes from the BIOS via ras:mc_event (the new ghes_edac driver). It is still on my TODO list to add a RAS trace event for non-memory related errors that come via the MCA machine check handler (mcelog). While progress made was made at Kernel infrastructure, the needed userspace tools were still lacking. So, I decided to start materializing the userspace counterpart for what it was informally named as rasdaemon on some discussions. The rasdaemon tool is available at: http://git.infradead.org/users/mchehab/rasdaemon.git The current version is on very early stages, and it has a copy on it of the library that Steven Rostedt's is writing for trace-cmd tool. The plan is to use the trace-cmd library, when it starts to packaged as a separate library. I'd like to thanks Steven for the help he gave me to write this initial version. The current version of the tool enables the ras:mc_event log, and reads it via the raw trace debugfs node: /sys/kernel/debug/tracing/per_cpu/cpu*/trace_pipe_raw It also has a code that allows recording the errors via an sqlite3 database. The long term plan is to provide a tool that will catch and handle all ras:* error events that comes from the Kernel tracing infrastructure, logging them and providing tools to report it, being able to detect burst errors (like the ones caused by a solar storm at memories) or sparsed errors, in a way that would provide a glue to the users about the root cause of the error. Of course, there are much to do there. It is a natural evolution of the tool to add support there for the ras:aer_event traces that can come from PCIe AER. While it currently works with current Kernels since kernel 3.5, there are a number of interesting changes at tracing that are planned to be merged for Kernel 3.10: - poll() support for per_cpu trace_pipe_raw; - a timestamp that could more easily associated with machine's uptime; - support for a separate ringbuffer for RAS. So, it is planned the minimal requirement for the final version (v1.0) would be kernel 3.10. This is currently on very early staging. Help is needed ;) So, please send us suggestions, patches etc to the EDAC mailing list: linux-edac@vger.kernel.org Thanks, Mauro Carvalho Chehab 2013-03-14 - [1] Currently, ras:mc_event is at include/ras/. It is on my todo list to move it to be together with ras:aer_event, at include/trace/events/ras.h. 0707010000000D000081A400000000000000000000000165C04BE400002696000000000000000000000000000000000000002900000000rasdaemon-0.8.0.49.git+f9cb13b/README.mdRAS Daemon ========== Those tools provide a way to get Platform Reliability, Availability and Serviceability (RAS) reports made via the Kernel tracing events. The main repository for the rasdaemon is at Fedora hosted: - <http://git.infradead.org/users/mchehab/rasdaemon.git> And two mirrors are available: - <https://github.com/mchehab/rasdaemon> - <https://gitlab.com/mchehab_kernel/rasdaemon> Tarballs for each release can be found at: - <http://www.infradead.org/~mchehab/rasdaemon/> GOALS ===== Its initial goal is to replace the edac-tools that got bitroted after the addition of the HERM (Hardware Events Report Method )patches[^1] at the EDAC Kernel drivers. [^1]: <http://lkml.indiana.edu/hypermail/linux/kernel/1205.1/02075.html> Its long term goal is to be the userspace tool that will collect all hardware error events reported by the Linux Kernel from several sources (EDAC, MCE, PCI, ...) into one common framework. It is not meant to provide tools for doing error injection, as there are other tools already covering it, like: <git://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/mce-test.git> Yet, a few set of testing scripts are provided under /contrib dir. When the final version of the HERM patches was merged upstream, it was decided to not expose the memory error counters to userspace. This is one of the differences from what it was provided by edac-utils, as EDAC 2.0.0 exports errors via a set of sysfs nodes that sums the amount of errors per DIMM, per memory channel and per memory controller. However, those counters are monotonically increased, and there's no way to detect if they're very sparsed in time, if the occurrence is increasing over time, or if they're due to some burst, perhaps due to a Solar Storm hitting the ionosphere. In other words, the rationale for not exposing such the information is that: 1. can be easily accounted on userspace; 2. they're not really meaningful. E. g. one system with, let's say 10 corrected errors can be fine, while another one with the same amount of errors can have problems, as the error counters don't take into account things like system uptime, memory error bursts (that could be caused by a solar storm, for example), etc. So, the idea since them was to make the kernel-userspace interface simpler and move the policy to the userspace daemon. It is up to the userspace daemon to correlate the data about the RAS events and provide the system administrator a comprehensive report, presenting him a better hint if he needs to contact the hardware vendor to replace a component that is working degraded, or to simply discard the error. So, the approach taken here is to allow storing those errors on a SQLite database, in order to allow those data to be latter mining. It is currently not part of the scope to do sophiscicated data mininy analysis, as that would require enough statistitical data about hardware MTBF. In other words, an abnormal component that needs to be replaced shoud be statistically compared with a similar component that operates under a normal condition. To do such checks, the analysis tool would need to know the probability density function(p. d. f.) of that component, and its rellevant parameters (like mean and standard derivation, if the p. d. f. funcion is a Normal distribution). While this tool works since Kernel 3.5 (where HERM patches got added), in order to get the full benefit of this tool, Kernel 3.10 or upper is needed. COMPILING AND INSTALLING ======================== sqlite3 and autoconf needs to be installed. On Fedora, this is done by installing the following packages: ``` make gcc autoconf automake libtool libtraceevent-devel tar sqlite-devel (if sqlite3 will be used) perl-DBD-SQLite (if sqlite3 will be used) ``` To install then on Fedora, run: ``` $ dnf install -y make gcc autoconf automake libtool tar perl-dbd-sqlite \ libtraceevent-devel ``` Or, if sqlite3 database will be used to store data: ``` $ dnf install -y make gcc autoconf automake libtool tar sqlite-devel \ libtraceevent-devel ``` There are currently 3 features that are enabled optionally, via ./configure parameters: ``` --enable-sqlite3 enable storing data at SQL lite database (currently experimental) --enable-aer enable PCIe AER events (currently experimental) --enable-mce enable MCE events (currently experimental) ``` In order to compile it, run: ``` $ autoreconf -vfi $ ./configure [parameters] $ make ``` So, for example, to enable everything but sqlite3: ``` $ autoreconf -vfi && ./configure --enable-aer --enable-mce && make ``` After compiling, run, as root: ``` $ make install ``` RPM-based compilation ===================== If the distribution is rpm-based, an alternative method would be to do: ``` $ autoreconf -vfi && ./configure ``` The above procedure will generate a file at misc/rasdaemon.spec. You may edit it, in order to add/remove the --enable-\[option\] parameters. To generate the rpm files, do: ``` # make mock ``` To install the rpm files, run, as root: ``` # rpm -i $(ls SRPMS/rasdaemon-*.rpm|tail -1) ``` RUNNING ======= The daemon generally requires root permission, in order to read the needed debugfs trace nodes, with needs to be previously mounted. The rasdaemon will check at /proc/mounts where the debugfs partition is mounted and use it while running. To run the rasdaemon in background, just call it without any parameters: ``` # rasdaemon ``` The output will be available via syslog. Or, to run it in foreground and see the logs in console, run it as: ``` # rasdaemon -f ``` or, if you also want to record errors at the database (--enable-sqlite3 is required): ``` # rasdaemon -f -r ``` To post-process and decode received MCA errors on AMD SMCA systems, run: ``` # rasdaemon -p --status <STATUS_reg> --ipid <IPID_reg> --smca --family <CPU Family> --model <CPU Model> --bank <BANK_NUM> ``` Status and IPID Register values (in hex) are mandatory. The `smca` flag with `family` and `model` are required if not decoding locally. `Bank` parameter is optional. You may also start it via systemd: ``` # systemctl start rasdaemon ``` The rasdaemon will then output the messages to journald. TESTING ======= A script is provided under /contrib, in order to test the daemon EDAC handler. While the daemon is running, just run: ``` # contrib/edac-fake-inject ``` The script requires a Kernel compiled with CONFIG_EDAC_DEBUG and a running EDAC driver. MCE error handling can use the MCE inject: <https://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git> For it to work, Kernel mce-inject module should be compiled and loaded. APEI error injection can use this tool: <https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/mce-test.git/> AER error injection can use this tool: <https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git/> # SUBMITTING PATCHES If you want to help improving this tool, be my guest! We try to follow the Kernel's CodingStyle and submission rules as a reference. In order to contribute with rasdaemon, please send a Merge Request via github repository at: - <https://github.com/mchehab/rasdaemon> Or, alternatively, send a pull request against gitlab repository at: - <https://gitlab.com/mchehab_kernel/rasdaemon> It is also recommended to send patches to <linux-edac@vger.kernel.org> with a copy to: - Mauro Carvalho Chehab \<<mchehab@kernel.org>\> Please notice that github is the preferred way. If you're not using it, please be kind enough to add an issue there for us to track the patch series. Don't foget to add a description of the patch in the body of the email, adding a Signed-off-by: at the end of the patch description (before the unified diff with the patch). We use Signed-off-by the same way as in kernel, so I'm transcribing bellow the same text as found under Kernel's Documentation/SubmittingPatches: ``` "To improve tracking of who did what, especially with patches that can percolate to their final resting place in the kernel through several layers of maintainers, we've introduced a "sign-off" procedure on patches that are being emailed around. The sign-off is a simple line at the end of the explanation for the patch, which certifies that you wrote it or otherwise have the right to pass it on as an open-source patch. The rules are pretty simple: if you can certify the below: Developer's Certificate of Origin 1.1 By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. then you just add a line saying Signed-off-by: Random J Developer <random@developer.example.org> using your real name (sorry, no pseudonyms or anonymous contributions.)" ``` 0707010000000E000081A400000000000000000000000165C04BE4000000F3000000000000000000000000000000000000002400000000rasdaemon-0.8.0.49.git+f9cb13b/TODO1) Handle signals. 2) Better handle error conditions to be sure that events won't be lost. 3) Test support for PCIe AER trace records. 4) Better parse mce trace records. 5) Make it work fine with offline CPUs. 6) Handle CPU hotplugs. 0707010000000F000081A400000000000000000000000165C04BE400000A63000000000000000000000000000000000000002A00000000rasdaemon-0.8.0.49.git+f9cb13b/bitfield.c/* * Copyright (C) 2013 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> * * The code below were adapted from Andi Kleen/Intel/SuSe mcelog code, * released under GNU Public General License, v.2 * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <string.h> #include <stdio.h> #include "ras-mce-handler.h" #include "bitfield.h" unsigned int bitfield_msg(char *buf, size_t len, const char **bitarray, unsigned int array_len, unsigned int bit_offset, unsigned int ignore_bits, uint64_t status) { int i, n; char *p = buf; len--; for (i = 0; i < array_len; i++) { if (status & ignore_bits) continue; if (status & (1 << (i + bit_offset))) { if (p != buf) { n = snprintf(p, len, ", "); if (n < 0) break; len -= n; p += n; } if (!bitarray[i]) n = snprintf(p, len, "BIT%d", i + bit_offset); else n = snprintf(p, len, "%s", bitarray[i]); if (n < 0) break; len -= n; p += n; } } *p = 0; return p - buf; } static uint64_t bitmask(uint64_t i) { uint64_t mask = 1; while (mask < i) mask = (mask << 1) | 1; return mask; } void decode_bitfield(struct mce_event *e, uint64_t status, struct field *fields) { struct field *f; for (f = fields; f->str; f++) { uint64_t v = (status >> f->start_bit) & bitmask(f->stringlen - 1); char *s = NULL; if (v < f->stringlen) s = f->str[v]; if (!s) { if (v == 0) continue; mce_snprintf(e->error_msg, "<%u:%llx>", f->start_bit, (long long)v); } else mce_snprintf(e->error_msg, "%s", s); } } void decode_numfield(struct mce_event *e, uint64_t status, struct numfield *fields) { struct numfield *f; for (f = fields; f->name; f++) { uint64_t mask = (1ULL << (f->end - f->start + 1)) - 1; uint64_t v = (status >> f->start) & mask; if (v > 0 || f->force) { char fmt[32] = {0}; snprintf(fmt, 32, "%%s: %s\n", f->fmt ? f->fmt : "%Lu"); mce_snprintf(e->error_msg, fmt, f->name, v); } } } 07070100000010000081A400000000000000000000000165C04BE400000846000000000000000000000000000000000000002A00000000rasdaemon-0.8.0.49.git+f9cb13b/bitfield.h/* * The code below came from Andi Kleen/Intel/SuSe mcelog code, * released under GNU Public General License, v.2 * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <stdint.h> /* Generic bitfield decoder */ struct field { unsigned start_bit; char **str; unsigned stringlen; }; struct numfield { unsigned start, end; char *name; char *fmt; int force; }; #define FIELD(start_bit, name) { start_bit, name, ARRAY_SIZE(name) } #define FIELD_NULL(start_bit) { start_bit, NULL, 0 } #define SBITFIELD(start_bit, string) { start_bit, ((char * [2]) { NULL, string }), 2 } #define NUMBER(start, end, name) { start, end, name, "%Lu", 0 } #define NUMBERFORCE(start, end, name) { start, end, name, "%Lu", 1 } #define HEXNUMBER(start, end, name) { start, end, name, "%Lx", 0 } #define HEXNUMBERFORCE(start, end, name) { start, end, name, "%Lx", 1 } struct mce_event; void decode_bitfield(struct mce_event *e, uint64_t status, struct field *fields); void decode_numfield(struct mce_event *e, uint64_t status, struct numfield *fields); #define MASK(x) ((1ULL << (1 + (x))) - 1) #define EXTRACT(v, a, b) (((v) >> (a)) & MASK((b)-(a))) static inline int test_prefix(int nr, uint32_t value) { return ((value >> nr) == 1); } /* Ancillary routines */ unsigned bitfield_msg(char *buf, size_t len, const char **bitarray, unsigned array_len, unsigned bit_offset, unsigned ignore_bits, uint64_t status); 07070100000011000081A400000000000000000000000165C04BE4000028DC000000000000000000000000000000000000002C00000000rasdaemon-0.8.0.49.git+f9cb13b/configure.acAC_INIT([RASdaemon],[0.8.0]) AM_SILENT_RULES([yes]) AC_CANONICAL_TARGET AC_CONFIG_MACRO_DIR([m4]) AC_CONFIG_HEADERS([config.h]) AM_INIT_AUTOMAKE([foreign]) AC_PROG_CC AC_PROG_INSTALL LT_INIT X_AC_META AC_CONFIG_FILES([ Makefile man/Makefile man/ras-mc-ctl.8 man/rasdaemon.1 misc/rasdaemon.spec util/Makefile util/ras-mc-ctl ]) AC_ARG_ENABLE([all], AS_HELP_STRING([--enable-all], [enable all features])) AC_ARG_ENABLE([sqlite3], AS_HELP_STRING([--enable-sqlite3], [enable storing data at SQL lite database (currently experimental)])) AS_IF([test "x$enable_sqlite3" = "xyes" || test "x$enable_all" = "xyes"], [ AC_CHECK_LIB(sqlite3, sqlite3_open,[echo "found sqlite3"] , AC_MSG_ERROR([*** Unable to find sqlite3 library]), ) SQLITE3_LIBS="-lsqlite3" AC_DEFINE(HAVE_SQLITE3,1,"have sqlite3") AC_SUBST([WITH_SQLITE3]) ]) AM_CONDITIONAL([WITH_SQLITE3], [test x$enable_sqlite3 = xyes || test x$enable_all = xyes]) AM_COND_IF([WITH_SQLITE3], [USE_SQLITE3="yes"], [USE_SQLITE3="no"]) AC_SUBST([SQLITE3_LIBS]) has_libtraceevent_ver=0 dnl check for tracevent library PKG_CHECK_MODULES([LIBTRACEEVENT], [libtraceevent], [has_libtraceevent_ver=1]) AS_IF([test "$has_libtraceevent_ver" -eq 0], [ AC_MSG_ERROR([libtraceevent is required but were not found]) ]) AC_ARG_ENABLE([aer], AS_HELP_STRING([--enable-aer], [enable PCIe AER events (currently experimental)])) AS_IF([test "x$enable_aer" = "xyes" || test "x$enable_all" = "xyes"], [ AC_DEFINE(HAVE_AER,1,"have PCIe AER events collect") AC_SUBST([WITH_AER]) ]) AM_CONDITIONAL([WITH_AER], [test x$enable_aer = xyes || test x$enable_all = xyes]) AM_COND_IF([WITH_AER], [USE_AER="yes"], [USE_AER="no"]) AC_ARG_ENABLE([non_standard], AS_HELP_STRING([--enable-non-standard], [enable NON_STANDARD events (currently experimental)])) AS_IF([test "x$enable_non_standard" = "xyes" || test "x$enable_all" = "xyes"], [ AC_DEFINE(HAVE_NON_STANDARD,1,"have UNKNOWN_SEC events collect") AC_SUBST([WITH_NON_STANDARD]) ]) AM_CONDITIONAL([WITH_NON_STANDARD], [test x$enable_non_standard = xyes || test x$enable_all = xyes]) AM_COND_IF([WITH_NON_STANDARD], [USE_NON_STANDARD="yes"], [USE_NON_STANDARD="no"]) AC_ARG_ENABLE([arm], AS_HELP_STRING([--enable-arm], [enable ARM events (currently experimental)])) AS_IF([test "x$enable_arm" = "xyes" || test "x$enable_all" = "xyes"], [ AC_DEFINE(HAVE_ARM,1,"have ARM events collect") AC_SUBST([WITH_ARM]) ]) AM_CONDITIONAL([WITH_ARM], [test x$enable_arm = xyes || test x$enable_all = xyes]) AM_COND_IF([WITH_ARM], [USE_ARM="yes"], [USE_ARM="no"]) AC_ARG_ENABLE([mce], AS_HELP_STRING([--enable-mce], [enable MCE events (currently experimental)])) AS_IF([test "x$enable_mce" = "xyes" || test "x$enable_all" = "xyes"], [ AC_DEFINE(HAVE_MCE,1,"have PCIe MCE events collect") AC_SUBST([WITH_MCE]) ]) AM_CONDITIONAL([WITH_MCE], [test x$enable_mce = xyes || test x$enable_all = xyes]) AM_COND_IF([WITH_MCE], [USE_MCE="yes"], [USE_MCE="no"]) AC_ARG_ENABLE([extlog], AS_HELP_STRING([--enable-extlog], [enable EXTLOG events (currently experimental)])) AS_IF([test "x$enable_extlog" = "xyes" || test "x$enable_all" = "xyes"], [ AC_DEFINE(HAVE_EXTLOG,1,"have EXTLOG events collect") AC_SUBST([WITH_EXTLOG]) ]) AM_CONDITIONAL([WITH_EXTLOG], [test x$enable_extlog = xyes || test x$enable_all = xyes]) AM_COND_IF([WITH_EXTLOG], [USE_EXTLOG="yes"], [USE_EXTLOG="no"]) AC_ARG_ENABLE([devlink], AS_HELP_STRING([--enable-devlink], [enable devlink health events (currently experimental)])) AS_IF([test "x$enable_devlink" = "xyes" || test "x$enable_all" = "xyes"], [ AC_DEFINE(HAVE_DEVLINK,1,"have devlink health events collect") AC_SUBST([WITH_DEVLINK]) ]) AM_CONDITIONAL([WITH_DEVLINK], [test x$enable_devlink = xyes || test x$enable_all = xyes]) AM_COND_IF([WITH_DEVLINK], [USE_DEVLINK="yes"], [USE_DEVLINK="no"]) AC_ARG_ENABLE([diskerror], AS_HELP_STRING([--enable-diskerror], [enable disk I/O error events (currently experimental)])) AS_IF([test "x$enable_diskerror" = "xyes" || test "x$enable_all" = "xyes"], [ AC_DEFINE(HAVE_DISKERROR,1,"have disk I/O errors collect") AC_SUBST([WITH_DISKERROR]) ]) AM_CONDITIONAL([WITH_DISKERROR], [test x$enable_diskerror = xyes || test x$enable_all = xyes]) AM_COND_IF([WITH_DISKERROR], [USE_DISKERROR="yes"], [USE_DISKERROR="no"]) AC_ARG_ENABLE([memory_failure], AS_HELP_STRING([--enable-memory-failure], [enable memory failure events (currently experimental)])) AS_IF([test "x$enable_memory_failure" = "xyes" || test "x$enable_all" = "xyes"], [ AC_DEFINE(HAVE_MEMORY_FAILURE,1,"have memory failure events collect") AC_SUBST([WITH_MEMORY_FAILURE]) ]) AM_CONDITIONAL([WITH_MEMORY_FAILURE], [test x$enable_memory_failure = xyes || test x$enable_all = xyes]) AM_COND_IF([WITH_MEMORY_FAILURE], [USE_MEMORY_FAILURE="yes"], [USE_MEMORY_FAILURE="no"]) AC_ARG_ENABLE([cxl], AS_HELP_STRING([--enable-cxl], [enable CXL events (currently experimental)])) AS_IF([test "x$enable_cxl" = "xyes" || test "x$enable_all" == "xyes"], [ AC_DEFINE(HAVE_CXL,1,"have CXL events collect") AC_SUBST([WITH_CXL]) ]) AM_CONDITIONAL([WITH_CXL], [test x$enable_cxl = xyes || test x$enable_all == xyes]) AM_COND_IF([WITH_CXL], [USE_CXL="yes"], [USE_CXL="no"]) AC_ARG_ENABLE([abrt_report], AS_HELP_STRING([--enable-abrt-report], [enable report event to ABRT (currently experimental)])) AS_IF([test "x$enable_abrt_report" = "xyes" || test "x$enable_all" = "xyes"], [ AC_DEFINE(HAVE_ABRT_REPORT,1,"have report event to ABRT") AC_SUBST([WITH_ABRT_REPORT]) ]) AM_CONDITIONAL([WITH_ABRT_REPORT], [test x$enable_abrt_report = xyes || test x$enable_all = xyes]) AM_COND_IF([WITH_ABRT_REPORT], [USE_ABRT_REPORT="yes"], [USE_ABRT_REPORT="no"]) AC_ARG_ENABLE([hisi_ns_decode], AS_HELP_STRING([--enable-hisi-ns-decode], [enable HISI_NS_DECODE events (currently experimental)])) AS_IF([test "x$enable_hisi_ns_decode" = "xyes" || test "x$enable_all" = "xyes"], [ AC_DEFINE(HAVE_HISI_NS_DECODE,1,"have HISI UNKNOWN_SEC events decode") AC_SUBST([WITH_HISI_NS_DECODE]) ]) AM_CONDITIONAL([WITH_HISI_NS_DECODE], [test x$enable_hisi_ns_decode = xyes || test x$enable_all = xyes]) AM_COND_IF([WITH_HISI_NS_DECODE], [USE_HISI_NS_DECODE="yes"], [USE_HISI_NS_DECODE="no"]) AC_ARG_ENABLE([memory_ce_pfa], AS_HELP_STRING([--enable-memory-ce-pfa], [enable memory Corrected Error predictive failure analysis])) AS_IF([test "x$enable_memory_ce_pfa" = "xyes" || test "x$enable_all" = "xyes"], [ AC_DEFINE(HAVE_MEMORY_CE_PFA,1,"have memory corrected error predictive failure analysis") AC_SUBST([WITH_MEMORY_CE_PFA]) ]) AM_CONDITIONAL([WITH_MEMORY_CE_PFA], [test x$enable_memory_ce_pfa = xyes || test x$enable_all = xyes]) AM_COND_IF([WITH_MEMORY_CE_PFA], [USE_MEMORY_CE_PFA="yes"], [USE_MEMORY_CE_PFA="no"]) AC_ARG_ENABLE([amp_ns_decode], AS_HELP_STRING([--enable-amp-ns-decode], [enable AMP_NS_DECODE events (currently experimental)])) AS_IF([test "x$enable_amp_ns_decode" = "xyes" || test "x$enable_all" = "xyes"], [ AC_DEFINE(HAVE_AMP_NS_DECODE,1,"have AMP UNKNOWN_SEC events decode") AC_SUBST([WITH_AMP_NS_DECODE]) ]) AM_CONDITIONAL([WITH_AMP_NS_DECODE], [test x$enable_amp_ns_decode = xyes || test x$enable_all = xyes]) AM_COND_IF([WITH_AMP_NS_DECODE], [USE_AMP_NS_DECODE="yes"], [USE_AMP_NS_DECODE="no"]) AC_ARG_ENABLE([jaguar_ns_decode], AS_HELP_STRING([--enable-jaguar-ns-decode], [enable JAGUAR_NS_DECODE events (currently experimental)])) AS_IF([test "x$enable_jaguar_ns_decode" = "xyes" || test "x$enable_all" = "xyes"], [ AC_DEFINE(HAVE_JAGUAR_NS_DECODE,1,"have JaguarMicro UNKNOWN_SEC events decode") AC_SUBST([WITH_JAGUAR_NS_DECODE]) ]) AM_CONDITIONAL([WITH_JAGUAR_NS_DECODE], [test x$enable_jaguar_ns_decode = xyes || test x$enable_all = xyes]) AM_COND_IF([WITH_JAGUAR_NS_DECODE], [USE_JAGUAR_NS_DECODE="yes"], [USE_JAGUAR_NS_DECODE="no"]) AC_ARG_ENABLE([cpu_fault_isolation], AS_HELP_STRING([--enable-cpu-fault-isolation], [enable cpu online fault isolation])) AS_IF([test "x$enable_cpu_fault_isolation" = "xyes" || test "x$enable_all" = "xyes"], [ AC_DEFINE(HAVE_CPU_FAULT_ISOLATION,1,"have cpu online fault isolation") AC_SUBST([WITH_CPU_FAULT_ISOLATION]) ]) AM_CONDITIONAL([WITH_CPU_FAULT_ISOLATION], [test x$enable_cpu_fault_isolation = xyes || test x$enable_all = xyes]) AM_COND_IF([WITH_CPU_FAULT_ISOLATION], [USE_CPU_FAULT_ISOLATION="yes"], [USE_CPU_FAULT_ISOLATION="no"]) AC_ARG_ENABLE([yitian_ns_decode], AS_HELP_STRING([--enable-yitian-ns-decode], [enable YITIAN_NS_DECODE events (currently experimental)])) AS_IF([test "x$enable_yitian_ns_decode" = "xyes" || test "x$enable_all" == "xyes"], [ AC_DEFINE(HAVE_YITIAN_NS_DECODE,1,"have YITIAN UNKNOWN_SEC events decode") AC_SUBST([WITH_YITIAN_NS_DECODE]) ]) AM_CONDITIONAL([WITH_YITIAN_NS_DECODE], [test x$enable_yitian_ns_decode = xyes || test x$enable_all == xyes]) AM_COND_IF([WITH_YITIAN_NS_DECODE], [USE_YITIAN_NS_DECODE="yes"], [USE_YITIAN_NS_DECODE="no"]) test "$sysconfdir" = '${prefix}/etc' && sysconfdir=/etc CFLAGS="$CFLAGS -Wall -Wmissing-prototypes -Wstrict-prototypes" AC_SUBST([rasstatedir], [$localstatedir/lib/rasdaemon]) AC_DEFINE_DIR([RASSTATEDIR], [rasstatedir], [rasdaemon db store state dir]) AC_SUBST([RASSTATEDIR]) AC_ARG_WITH(sysconfdefdir, AS_HELP_STRING([--with-sysconfdefdir=DIR],[rasdaemon environment file dir]), [SYSCONFDEFDIR=$withval], [SYSCONFDEFDIR=/etc/sysconfig]) AC_SUBST([SYSCONFDEFDIR]) AC_DEFINE([RAS_DB_FNAME], ["ras-mc_event.db"], [ras events database]) AC_SUBST([RAS_DB_FNAME], ["ras-mc_event.db"]) AC_OUTPUT dnl --------------------------------------------------------------------- dnl compile time options summary cat <<EOF compile time options summary ============================ Sqlite3 : $USE_SQLITE3 AER : $USE_AER MCE : $USE_MCE EXTLOG : $USE_EXTLOG CPER non-standard : $USE_NON_STANDARD ABRT report : $USE_ABRT_REPORT HISI Kunpeng errors : $USE_HISI_NS_DECODE ARM events : $USE_ARM DEVLINK : $USE_DEVLINK Disk I/O errors : $USE_DISKERROR Memory Failure : $USE_MEMORY_FAILURE CXL events : $USE_CXL Memory CE PFA : $USE_MEMORY_CE_PFA AMP RAS errors : $USE_AMP_NS_DECODE CPU fault isolation : $USE_CPU_FAULT_ISOLATION YITIAN RAS errors : $USE_YITIAN_NS_DECODE JAGUAR RAS errors : $USE_JAGUAR_NS_DECODE EOF 07070100000012000041ED00000000000000000000000265C04BE400000000000000000000000000000000000000000000002700000000rasdaemon-0.8.0.49.git+f9cb13b/contrib07070100000013000081ED00000000000000000000000165C04BE400000551000000000000000000000000000000000000003800000000rasdaemon-0.8.0.49.git+f9cb13b/contrib/edac-fake-inject#!/bin/bash MC="$(ls -d /sys/devices/system/edac/mc/mc? |sed -e s,.*/mc,,)" SYSFS="$(cat /proc/mounts|grep debugfs|cut -d' ' -f 2)" if [ ! -e $SYSFS/edac/ ]; then echo "$SYSFS/edac not found." echo " It seems that your Kernel was not compiled with CONFIG_EDAC_DEBUG."; exit -1; fi for i in $MC; do LAYER1=$(cat /sys/devices/system/edac/mc/mc$i/max_location |cut -d ' ' -f 1) LAYER2=$(cat /sys/devices/system/edac/mc/mc$i/max_location |cut -d ' ' -f 3) LAYER3=$(cat /sys/devices/system/edac/mc/mc$i/max_location |cut -d ' ' -f 5) DEBUGFS=$SYSFS/edac/mc$i/ MAX=$(cat /sys/devices/system/edac/mc/mc$i/max_location |cut -d' ' -f 2) for j in `seq 0 $MAX`; do echo $j > $DEBUGFS/fake_inject_$LAYER1 if [ "$LAYER2" == "" ]; then echo "Injecting errors at mc#$i $j" echo > $DEBUGFS/fake_inject else MAX=$(cat /sys/devices/system/edac/mc/mc$i/max_location |cut -d' ' -f 4) for k in `seq 0 $MAX`; do echo $k > $DEBUGFS/fake_inject_$LAYER2 if [ "$LAYER3" == "" ]; then echo "Injecting errors at mc#$i $j:$k" echo > $DEBUGFS/fake_inject else MAX=$(cat /sys/devices/system/edac/mc/mc$i/max_location |cut -d' ' -f 6) for l in `seq 0 $MAX`; do echo $l > $DEBUGFS/fake_inject_$LAYER3 echo "Injecting errors at mc#$i $j:$k:$l" echo > $DEBUGFS/fake_inject done fi done fi done done 07070100000014000081ED00000000000000000000000165C04BE400000953000000000000000000000000000000000000003200000000rasdaemon-0.8.0.49.git+f9cb13b/contrib/edac-tests#!/bin/bash FILE=new-$(hostname)-$(date +"%Y-%m-%d-%H-%M-%S").txt run() { echo "# $@" >> $FILE $@ 2>> $FILE >> $FILE if [ "$?" != "0" ]; then ERR=$? echo "Error on $0# $@" echo "Error $ERR" >> $FILE tail -f $FILE exit -1 fi } run_noerror() { echo "# $@" >> $FILE $@ 2>> $FILE >> $FILE } DRIVER=$(lsmod|grep edac_core|cut -b 33-) echo "RUNNING driver $DRIVER on kernel `uname -r`, hostname `hostname`, at `date`" >$FILE mount -t debugfs debugfs /sys/kernel/debug/ run ras-mc-ctl --layout run ras-mc-ctl --guess-labels run free -l run dmidecode run_noerror tree /sys/devices/system/edac/ run_noerror grep . /sys/devices/system/edac/mc/mc?/dimm*/* /sys/devices/system/edac/mc/mc?/rank*/* run_noerror grep . /sys/devices/system/edac/mc/mc?/csrow*/* run_noerror dmesg for i in /sys/devices/system/edac/mc/mc?/reset_counters; do echo 1 >$i done echo "ras:*" >/sys/kernel/debug/tracing/set_event run echo "Enabled events: " run cat /sys/kernel/debug/tracing/set_event MC="$(ls -d /sys/devices/system/edac/mc/mc? |sed -e s,.*/mc,,)" for i in $MC; do LAYER1=$(cat /sys/devices/system/edac/mc/mc$i/max_location |cut -d ' ' -f 1) LAYER2=$(cat /sys/devices/system/edac/mc/mc$i/max_location |cut -d ' ' -f 3) LAYER3=$(cat /sys/devices/system/edac/mc/mc$i/max_location |cut -d ' ' -f 5) DEBUGFS=/sys/kernel/debug/edac/mc$i/ MAX=$(cat /sys/devices/system/edac/mc/mc$i/max_location |cut -d' ' -f 2) for j in `seq 0 $MAX`; do MAX=$(cat /sys/devices/system/edac/mc/mc$i/max_location |cut -d' ' -f 4) for k in `seq 0 $MAX`; do MAX=$(cat /sys/devices/system/edac/mc/mc$i/max_location |cut -d' ' -f 6) echo $j > $DEBUGFS/fake_inject_$LAYER1 echo $k > $DEBUGFS/fake_inject_$LAYER2 if [ "$MAX" == "" ]; then echo "Injecting errors at mc#$i $j:$k" echo "Injecting errors at mc#$i $j:$k" >> $FILE echo > $DEBUGFS/fake_inject dmesg |tail -3 >> $FILE else for l in `seq 0 $MAX`; do echo "Injecting errors at mc#$i $j:$k:$l" echo "Injecting errors at mc#$i $j:$k:$l" >> $FILE echo $l > $DEBUGFS/fake_inject_$LAYER3 echo > $DEBUGFS/fake_inject dmesg |tail -3 >> $FILE done fi done done done run grep . /sys/devices/system/edac/mc/mc?/*e_* run cat /sys/kernel/debug/tracing/trace # FIXME: need to add some logic there to check if the proper error # counts are incremented, without producing a very long log 07070100000015000041ED00000000000000000000000265C04BE400000000000000000000000000000000000000000000002600000000rasdaemon-0.8.0.49.git+f9cb13b/labels07070100000016000081A400000000000000000000000165C04BE4000001E9000000000000000000000000000000000000002C00000000rasdaemon-0.8.0.49.git+f9cb13b/labels/apple# RASDAEMON Motherboard DIMM labels Database file. # # Vendor-name and model-name are found from the program 'dmidecode' # labels are found from the silk screen on the motherboard. # #Vendor: <vendor-name> # Model: <model-name> # <label>: <mc>.<branch>.<channel>.<slot> # Vendor: Apple Inc. Model: Mac-F42C88C8 DIMM1_RA: 0.1.0.0; DIMM2_RA: 0.1.1.0; DIMM3_RA: 0.1.0.1; DIMM4_RA: 0.1.1.1; DIMM1_RB: 0.0.0.0; DIMM2_RB: 0.0.1.0; DIMM3_RB: 0.0.0.1; DIMM4_RB: 0.0.1.1; 07070100000017000081A400000000000000000000000165C04BE40000031D000000000000000000000000000000000000002D00000000rasdaemon-0.8.0.49.git+f9cb13b/labels/asrock# RASDAEMON Motherboard DIMM labels Database file. # # Vendor-name and model-name are found from the program 'dmidecode' # labels are found from the silk screen on the motherboard. # #Vendor: <vendor-name> # Product: <product-name> # Model: <model-name> # <label>: <mc>.<top>.<mid>.<low> # # #Vendor: <vendor-name> # Model: <model-name> # <label>: <mc>.<row>.<channel> # Vendor: ASRock Model: X570 Phantom Gaming X DIMM_A1: 0.0.1, 0.1.1; DIMM_A2: 0.2.1, 0.3.1; DIMM_B1: 0.0.0, 0.1.0; DIMM_B2: 0.2.0, 0.3.0; Vendor: ASRockRack Model: X399D8A-2T DIMM_A1: 0.2.0, 0.3.0; DIMM_A2: 0.0.0, 0.1.0; DIMM_B1: 0.2.1, 0.3.1; DIMM_B2: 0.0.1, 0.1.1; DIMM_C1: 2.2.0, 2.3.0; DIMM_C2: 2.0.0, 2.1.0; DIMM_D1: 2.2.1, 2.3.1; DIMM_D2: 2.0.1, 2.1.1; 07070100000018000081A400000000000000000000000165C04BE400000473000000000000000000000000000000000000002B00000000rasdaemon-0.8.0.49.git+f9cb13b/labels/asus# RASDAEMON Motherboard DIMM labels Database file. # # Vendor-name and model-name are found from the program 'dmidecode' # labels are found from the silk screen on the motherboard. # #Vendor: <vendor-name> # Product: <product-name> # Model: <model-name> # <label>: <mc>.<top>.<mid>.<low> # # #Vendor: <vendor-name> # Model: <model-name> # <label>: <mc>.<row>.<channel> # Vendor: ASUSTeK COMPUTER INC. Model: PRIME X570-PRO DIMM_A1: 0.0.1, 0.1.1; DIMM_A2: 0.2.1, 0.3.1; DIMM_B1: 0.0.0, 0.1.0; DIMM_B2: 0.2.0, 0.3.0; Model: TUF GAMING B450-PLUS II DIMM_A1: 0.0.1, 0.1.1; DIMM_A2: 0.2.1, 0.3.1; DIMM_B1: 0.0.0, 0.1.0; DIMM_B2: 0.2.0, 0.3.0; Vendor: ASUSTeK COMPUTER INC. Model: Z9PH-D16 Series CPU1_DIMM_A1: 0.0.0 CPU1_DIMM_A2: 0.0.1 CPU1_DIMM_B1: 0.1.0 CPU1_DIMM_B2: 0.1.1 CPU1_DIMM_C1: 0.2.0 CPU1_DIMM_C2: 0.2.1 CPU1_DIMM_D1: 0.3.0 CPU1_DIMM_D2: 0.3.1 CPU2_DIMM_E1: 1.0.0 CPU2_DIMM_E2: 1.0.1 CPU2_DIMM_F1: 1.1.0 CPU2_DIMM_F2: 1.1.1 CPU2_DIMM_G1: 1.2.0 CPU2_DIMM_G2: 1.2.1 CPU2_DIMM_H1: 1.3.0 CPU2_DIMM_H2: 1.3.1 07070100000019000081A400000000000000000000000165C04BE400001A85000000000000000000000000000000000000002B00000000rasdaemon-0.8.0.49.git+f9cb13b/labels/dell# RASDAEMON Motherboard DIMM labels Database file. # # Vendor-name and model-name are found from the program 'dmidecode' # labels are found from the silk screen on the motherboard. # #Vendor: <vendor-name> # Product: <product-name> # Model: <model-name> # <label>: <mc>.<top>.<mid>.<low> # Vendor: Dell Inc. # 1-socket Product: PowerEdge R220, PowerEdge R330, PowerEdge T330, PowerEdge R230, PowerEdge T130, PowerEdge T30 DIMM_A1: 0.0.0; DIMM_A2: 0.0.1; DIMM_A3: 0.1.0; DIMM_A4: 0.1.1; Product: PowerEdge T110 II, PowerEdge T20 DIMM_A1: 0.0.0; DIMM_A2: 0.1.0; DIMM_B1: 0.0.1; DIMM_B2: 0.1.1; Product: PowerEdge R320, PowerEdge T320 DIMM_A1: 0.0.0; DIMM_A2: 0.1.0; DIMM_A3: 0.2.0; DIMM_A4: 0.0.1; DIMM_A5: 0.1.1; DIMM_A6: 0.2.1; # 2-socket Product: PowerEdge R610 DIMM_A1: 0.0.0; DIMM_A2: 0.0.1; DIMM_A3: 0.0.2; DIMM_A4: 0.1.0; DIMM_A5: 0.1.1; DIMM_A6: 0.1.2; DIMM_B1: 1.0.0; DIMM_B2: 1.0.1; DIMM_B3: 1.0.2; DIMM_B4: 1.1.0; DIMM_B5: 1.1.1; DIMM_B6: 1.1.2; Product: PowerEdge T710, PowerEdge R710 DIMM_A3: 0.0.0; DIMM_A2: 0.1.0; DIMM_A1: 0.2.0; DIMM_A6: 0.0.1; DIMM_A5: 0.1.1; DIMM_A4: 0.2.1; DIMM_A9: 0.0.2; DIMM_A8: 0.1.2; DIMM_A7: 0.2.2; DIMM_B3: 1.0.0; DIMM_B2: 1.1.0; DIMM_B1: 1.2.0; DIMM_B6: 1.0.1; DIMM_B5: 1.1.1; DIMM_B4: 1.2.1; DIMM_B9: 1.0.2; DIMM_B8: 1.1.2; DIMM_B7: 1.2.2; Product: PowerEdge R620, PowerEdge T620, PowerEdge R720xd, PowerEdge R730xd, PowerEdge T630, PowerEdge R730, PowerEdge R630, PowerEdge T620, PowerEdge M620, PowerEdge FC620, PowerEdge M630, PowerEdge FC630 DIMM_A1: 0.0.0; DIMM_A2: 0.1.0; DIMM_A3: 0.2.0; DIMM_A4: 0.3.0; DIMM_A5: 0.0.1; DIMM_A6: 0.1.1; DIMM_A7: 0.2.1; DIMM_A8: 0.3.1; DIMM_A9: 0.0.2; DIMM_A10: 0.1.2; DIMM_A11: 0.2.2; DIMM_A12: 0.3.2; DIMM_B1: 1.0.0; DIMM_B2: 1.1.0; DIMM_B3: 1.2.0; DIMM_B4: 1.3.0; DIMM_B5: 1.0.1; DIMM_B6: 1.1.1; DIMM_B7: 1.2.1; DIMM_B8: 1.3.1; DIMM_B9: 1.0.2; DIMM_B10: 1.1.2; DIMM_B11: 1.2.2; DIMM_B12: 1.3.2; Product: PowerEdge R640, PowerEdge R740, PowerEdge R740xd, PowerEdge T640 A1: 0.0.0; A2: 0.1.0; A3: 0.2.0; A4: 1.0.0; A5: 1.1.0; A6: 1.2.0; A7: 0.0.1; A8: 0.1.1; A9: 0.2.1; A10: 1.0.1; A11: 1.1.1; A12: 1.2.1; B1: 2.0.0; B2: 2.1.0; B3: 2.2.0; B4: 3.0.0; B5: 3.1.0; B6: 3.2.0; B7: 2.0.1; B8: 2.1.1; B9: 2.2.1; B10: 3.0.1; B11: 3.1.1; B12: 3.2.1; Product: PowerEdge M520, PowerEdge R420, PowerEdge T420 DIMM_A1: 0.1.0; DIMM_A2: 0.2.0; DIMM_A3: 0.3.0; DIMM_A4: 0.1.1; DIMM_A5: 0.2.1; DIMM_A6: 0.3.1; DIMM_B1: 1.1.0; DIMM_B2: 1.2.0; DIMM_B3: 1.3.0; DIMM_B4: 1.1.1; DIMM_B5: 1.2.1; DIMM_B6: 1.3.1; Product: PowerEdge FC420, PowerEdge M420 DIMM_A1: 0.0.0; DIMM_A2: 0.1.0; DIMM_A3: 0.2.0; DIMM_B1: 1.0.0; DIMM_B2: 1.1.0; DIMM_B3: 1.2.0; Product: PowerEdge C6320, PowerEdge C4130 DIMM_A1: 0.0.0; DIMM_A2: 0.1.0; DIMM_A3: 0.2.0; DIMM_A4: 0.3.0; DIMM_A5: 0.0.1; DIMM_A6: 0.1.1; DIMM_A7: 0.2.1; DIMM_A8: 0.3.1; DIMM_B1: 1.0.0; DIMM_B2: 1.1.0; DIMM_B3: 1.2.0; DIMM_B4: 1.3.0; DIMM_B5: 1.0.1; DIMM_B6: 1.1.1; DIMM_B7: 1.2.1; DIMM_B8: 1.3.1; Product: PowerEdge C6320p A1: 0.0.0; B1: 0.1.0; C1: 0.2.0; D1: 1.0.0; E1: 1.1.0; F1: 1.2.0; Product: PowerEdge C6420 A1: 0.0.0; A2: 0.1.0; A3: 0.2.0; A4: 1.0.0; A5: 1.1.0; A6: 1.2.0; A7: 0.0.1; A8: 1.0.1; B1: 2.0.0; B2: 2.1.0; B3: 2.2.0; B4: 3.0.0; B5: 3.1.0; B6: 3.2.0; B7: 2.0.1; B8: 3.0.1; Product: PowerEdge R430, PowerEdge T430, PowerEdge R530 DIMM_A1: 0.0.0; DIMM_A2: 0.1.0; DIMM_A3: 0.2.0; DIMM_A4: 0.3.0; DIMM_A5: 0.0.1; DIMM_A6: 0.1.1; DIMM_A7: 0.2.1; DIMM_A8: 0.3.1; DIMM_B1: 1.0.0; DIMM_B2: 1.1.0; DIMM_B3: 1.2.0; DIMM_B4: 1.3.0; Product: PowerEdge FC430 DIMM_A1: 0.1.0; DIMM_A2: 0.0.0; DIMM_A3: 0.2.0; DIMM_A4: 0.3.0; DIMM_B1: 1.1.0; DIMM_B2: 1.0.0; DIMM_B3: 1.2.0; DIMM_B4: 1.3.0; # 4-socket Product: PowerEdge M820, PowerEdge R830, PowerEdge M830, PowerEdge R930, PowerEdge FC830 DIMM_A1: 0.0.0; DIMM_A2: 0.1.0; DIMM_A3: 0.2.0; DIMM_A4: 0.3.0; DIMM_A5: 0.0.1; DIMM_A6: 0.1.1; DIMM_A7: 0.2.1; DIMM_A8: 0.3.1; DIMM_A9: 0.0.2; DIMM_A10: 0.1.2; DIMM_A11: 0.2.2; DIMM_A12: 0.3.2; DIMM_B1: 1.0.0; DIMM_B2: 1.1.0; DIMM_B3: 1.2.0; DIMM_B4: 1.3.0; DIMM_B5: 1.0.1; DIMM_B6: 1.1.1; DIMM_B7: 1.2.1; DIMM_B8: 1.3.1; DIMM_B9: 1.0.2; DIMM_B10: 1.1.2; DIMM_B11: 1.2.2; DIMM_B12: 1.3.2; DIMM_C1: 2.0.0; DIMM_C2: 2.1.0; DIMM_C3: 2.2.0; DIMM_C4: 2.3.0; DIMM_C5: 2.0.1; DIMM_C6: 2.1.1; DIMM_C7: 2.2.1; DIMM_C8: 2.3.1; DIMM_C9: 2.0.2; DIMM_C10: 2.1.2; DIMM_C11: 2.2.2; DIMM_C12: 2.3.2; DIMM_D1: 3.0.0; DIMM_D2: 3.1.0; DIMM_D3: 3.2.0; DIMM_D4: 3.3.0; DIMM_D5: 3.0.1; DIMM_D6: 3.1.1; DIMM_D7: 3.2.1; DIMM_D8: 3.3.1; DIMM_D9: 3.0.2; DIMM_D10: 3.1.2; DIMM_D11: 3.2.2; DIMM_D12: 3.3.2; Product: PowerEdge FM120x4 DIMM_A_A1: 0.1.0; DIMM_A_A2: 0.2.0; DIMM_B_A1: 1.1.0; DIMM_B_A2: 1.2.0; DIMM_C_A1: 2.1.0; DIMM_C_A2: 2.2.0; DIMM_D_A1: 3.1.0; DIMM_D_A2: 3.2.0; Product: PowerEdge R940 A1: 0.0.0; A2: 0.1.0; A3: 0.2.0; A4: 1.0.0; A5: 1.1.0; A6: 1.2.0; A7: 0.0.1; A8: 0.1.1; A9: 0.2.1; A10: 1.0.1; A11: 1.1.1; A12: 1.2.1; B1: 2.0.0; B2: 2.1.0; B3: 2.2.0; B4: 3.0.0; B5: 3.1.0; B6: 3.2.0; B7: 2.0.1; B8: 2.1.1; B9: 2.2.1; B10: 3.0.1; B11: 3.1.1; B12: 3.2.1; C1: 4.0.0; C2: 4.1.0; C3: 4.2.0; C4: 5.0.0; C5: 5.1.0; C6: 5.2.0; C7: 4.0.1; C8: 4.1.1; C9: 4.2.1; C10: 5.0.1; C11: 5.1.1; C12: 5.2.1; D1: 6.0.0; D2: 6.1.0; D3: 6.2.0; D4: 7.0.0; D5: 7.1.0; D6: 7.2.0; D7: 6.0.1; D8: 6.1.1; D9: 6.2.1; D10: 7.0.1; D11: 7.1.1; D12: 7.2.1; Product: PowerEdge R440, PowerEdge R540 A1: 0.0.0; A2: 0.1.0; A3: 0.2.0; A4: 1.0.0; A5: 1.1.0; A6: 1.2.0; A7: 0.0.1; A8: 0.1.1; A9: 1.0.1; A10: 1.1.1; B1: 2.0.0; B2: 2.1.0; B3: 2.2.0; B4: 3.0.0; B5: 3.1.0; B6: 3.2.0; Product: PowerEdge M640, PowerEdge FC640 A1: 0.0.0; A2: 0.1.0; A3: 0.2.0; A4: 1.0.0; A5: 1.1.0; A6: 1.2.0; A7: 0.0.1; A8: 1.0.1; B1: 2.0.0; B2: 2.1.0; B3: 2.2.0; B4: 3.0.0; B5: 3.1.0; B6: 3.2.0; B7: 2.0.1; B8: 3.0.1; 0707010000001A000081A400000000000000000000000165C04BE4000009FF000000000000000000000000000000000000002F00000000rasdaemon-0.8.0.49.git+f9cb13b/labels/gigabyte# vendor: GIGABYTE model: MZ62-HD0-00 # Gigabyte schema: # # <label>: <mc>.<row>.<channel> # ras-mc-ctl --layout # +---------------------------------------------------------------------------------------------------+ # | mc0 | mc1 | # | csrow0 | csrow1 | csrow2 | csrow3 | csrow0 | csrow1 | csrow2 | csrow3 | #----------+---------------------------------------------------------------------------------------------------+ #channel7: | 16384 MB | 16384 MB | 0 MB | 0 MB | 16384 MB | 16384 MB | 0 MB | 0 MB | #channel6: | 16384 MB | 16384 MB | 0 MB | 0 MB | 16384 MB | 16384 MB | 0 MB | 0 MB | #----------+---------------------------------------------------------------------------------------------------+ #channel5: | 16384 MB | 16384 MB | 0 MB | 0 MB | 16384 MB | 16384 MB | 0 MB | 0 MB | #channel4: | 16384 MB | 16384 MB | 0 MB | 0 MB | 16384 MB | 16384 MB | 0 MB | 0 MB | #----------+---------------------------------------------------------------------------------------------------+ #channel3: | 16384 MB | 16384 MB | 0 MB | 0 MB | 16384 MB | 16384 MB | 0 MB | 0 MB | #channel2: | 16384 MB | 16384 MB | 0 MB | 0 MB | 16384 MB | 16384 MB | 0 MB | 0 MB | #----------+---------------------------------------------------------------------------------------------------+ #channel1: | 16384 MB | 16384 MB | 0 MB | 0 MB | 16384 MB | 16384 MB | 0 MB | 0 MB | #channel0: | 16384 MB | 16384 MB | 0 MB | 0 MB | 16384 MB | 16384 MB | 0 MB | 0 MB | #----------+---------------------------------------------------------------------------------------------------+ DIMM_P0_A0: 0.0.0 DIMM_P0_A1: 0.0.1 DIMM_P0_B0: 0.0.2 DIMM_P0_B1: 0.0.3 DIMM_P0_C0: 0.0.4 DIMM_P0_C1: 0.0.5 DIMM_P0_D0: 0.0.6 DIMM_P0_D1: 0.0.7 DIMM_P0_E0: 0.1.0 DIMM_P0_E1: 0.1.1 DIMM_P0_F0: 0.1.2 DIMM_P0_F1: 0.1.3 DIMM_P0_G0: 0.1.4 DIMM_P0_G1: 0.1.4 DIMM_P0_H0: 0.1.6 DIMM_P0_H1: 0.1.7 DIMM_P1_I0: 1.0.0 DIMM_P1_I1: 1.0.1 DIMM_P1_J0: 1.0.2 DIMM_P1_J1: 1.0.3 DIMM_P1_K0: 1.0.4 DIMM_P1_K1: 1.0.5 DIMM_P1_L0: 1.0.6 DIMM_P1_L1: 1.0.7 DIMM_P1_M0: 1.1.0 DIMM_P1_M1: 1.1.1 DIMM_P1_N0: 1.1.2 DIMM_P1_N1: 1.1.3 DIMM_P1_O0: 1.1.4 DIMM_P1_O1: 1.1.5 DIMM_P1_P0: 1.1.6 DIMM_P1_P1: 1.1.7 0707010000001B000081A400000000000000000000000165C04BE4000010D5000000000000000000000000000000000000003100000000rasdaemon-0.8.0.49.git+f9cb13b/labels/supermicro# RASDAEMON Motherboard DIMM labels Database file. # # Vendor-name and model-name are found from the program 'dmidecode' # labels are found from the silk screen on the motherboard. # #Vendor: <vendor-name> # Product: <product-name> # Model: <model-name> # <label>: <mc>.<top>.<mid>.<low> # Vendor: Supermicro Model: A2SDi-8C-HLN4F, A2SDi-8C+-HLN4F DIMMA1: 0.0.0; DIMMA2: 0.0.1; DIMMB1: 0.1.0; DIMMB2: 0.1.1; Model: X10SRA-F DIMMA1: 0.0.0; DIMMA2: 0.0.1; DIMMB1: 0.1.0; DIMMB2: 0.1.1; DIMMC1: 1.0.0; DIMMC2: 1.0.1; DIMMD1: 1.1.0; DIMMD2: 1.1.1; Model: H8DGU P1_DIMM1A: 0.2.0; P1_DIMM1A: 0.3.0; P2_DIMM1A: 3.2.0; P2_DIMM1A: 3.3.0; P1_DIMM2A: 0.2.1; P1_DIMM2A: 0.3.1; P2_DIMM2A: 3.2.1; P2_DIMM2A: 3.3.1; P1_DIMM3A: 1.2.0; P1_DIMM3A: 1.3.0; P2_DIMM3A: 2.2.0; P2_DIMM3A: 2.3.0; P1_DIMM4A: 1.2.1; P1_DIMM4A: 1.3.1; P2_DIMM4A: 2.2.1; P2_DIMM4A: 2.3.1; P1_DIMM1B: 0.0.0; P1_DIMM1B: 0.2.0; P2_DIMM1B: 3.0.0; P2_DIMM1B: 3.1.0; P1_DIMM2B: 0.0.1; P1_DIMM2B: 0.1.1; P2_DIMM2B: 3.0.1; P2_DIMM2B: 3.1.1; P1_DIMM3B: 1.0.0; P1_DIMM3B: 1.1.0; P2_DIMM3B: 2.0.0; P2_DIMM3B: 2.1.0; P1_DIMM4B: 1.0.1; P1_DIMM4B: 1.1.1; P2_DIMM4B: 2.0.1; P2_DIMM4B: 2.1.1; Model: X11DPH-i, X11DPH-T, X11DPH-TQ P1-DIMMA1: 0.0.0; P1-DIMMA2: 0.0.1; P1-DIMMB1: 0.1.0; P1-DIMMC1: 0.2.0; P1-DIMMD1: 1.0.0; P1-DIMMD2: 1.0.1; P1-DIMME1: 1.1.0; P1-DIMMF1: 1.2.0; P2-DIMMA1: 2.0.0; P2-DIMMA2: 2.0.1; P2-DIMMB1: 2.1.0; P2-DIMMC1: 2.2.0; P2-DIMMD1: 3.0.0; P2-DIMMD2: 3.0.1; P2-DIMME1: 3.1.0; P2-DIMMF1: 3.2.0; Model: X10DRI, X10DRI-T P1-DIMMA1: 0.0.0; P1-DIMMA2: 0.0.1; P1-DIMMB1: 0.1.0; P1-DIMMB2: 0.1.1; P1-DIMMC1: 0.2.0; P1-DIMMC2: 0.2.1; P1-DIMMD1: 0.3.0; P1-DIMMD2: 0.3.1; P2-DIMME1: 1.0.0; P2-DIMME2: 1.0.1; P2-DIMMF1: 1.1.0; P2-DIMMF2: 1.1.1; P2-DIMMG1: 1.2.0; P2-DIMMG2: 1.2.1; P2-DIMMH1: 1.3.0; P2-DIMMH2: 1.3.1; Model: X10DRL-i P1-DIMMA1: 0.0.0; P1-DIMMB1: 0.1.0; P1-DIMMC1: 0.2.0; P1-DIMMD1: 0.3.0; P2-DIMME1: 1.0.0; P2-DIMMF1: 1.1.0; P2-DIMMG1: 1.2.0; P2-DIMMH1: 1.3.0; Model: X11DDW-NT, X11DDW-L P1-DIMMA1: 0.0.0; P1-DIMMB1: 0.1.0; P1-DIMMC1: 0.2.0; P1-DIMMD1: 1.0.0; P1-DIMME1: 1.1.0; P1-DIMMF1: 1.2.0; P2-DIMMA1: 2.0.0; P2-DIMMB1: 2.1.0; P2-DIMMC1: 2.2.0; P2-DIMMD1: 3.0.0; P2-DIMME1: 3.1.0; P2-DIMMF1: 3.2.0; Model: X11DPi-N, X11DPi-NT P1-DIMMA1: 0.0.0; P1-DIMMA2: 0.0.1; P1-DIMMB1: 0.1.0; P1-DIMMC1: 0.2.0; P1-DIMMD1: 1.0.0; P1-DIMMD2: 1.0.1; P1-DIMME1: 1.1.0; P1-DIMMF1: 1.2.0; P2-DIMMA1: 2.0.0; P2-DIMMA2: 2.0.1; P2-DIMMB1: 2.1.0; P2-DIMMC1: 2.2.0; P2-DIMMD1: 3.0.0; P2-DIMMD2: 3.0.1; P2-DIMME1: 3.1.0; P2-DIMMF1: 3.2.0; Model: X11SPM-F, X11SPM-TF, X11SPM-TPF DIMMA1: 0.0.0; DIMMB1: 0.1.0; DIMMC1: 0.2.0; DIMMD1: 1.0.0; DIMME1: 1.1.0; DIMMF1: 1.2.0; Model: B1DRi P1_DIMMA1: 0.0.0; P1_DIMMB1: 0.1.0; P1_DIMMC1: 0.2.0; P1_DIMMD1: 0.3.0; P2_DIMME1: 1.0.0; P2_DIMMF1: 1.1.0; P2_DIMMG1: 1.2.0; P2_DIMMH1: 1.3.0; Model: X11SCA, X11SCA-F DIMMA1: 0.0.0, 0.1.0; DIMMA2: 0.2.0, 0.3.0; DIMMB1: 0.0.1, 0.1.1; DIMMB2: 0.2.1, 0.3.1; Model: X11SCW-F DIMMA1: 0.1.0; DIMMA2: 0.0.0; DIMMB1: 0.1.1; DIMMB2: 0.0.1; # Intel Ice Lake-SP # Intel 3rd Generation Xeon Scalable CPU: 4 integrated Memory Controllers (iMC), 8 memory channels, and 16 DIMM slots Model: X12DPU-6 P1-DIMMA1: 0.0.0 P1-DIMMA2: 0.0.1 P1-DIMMB1: 0.1.0 P1-DIMMB2: 0.1.1 P1-DIMMC1: 1.0.0 P1-DIMMC2: 1.0.1 P1-DIMMD1: 1.1.0 P1-DIMMD2: 1.1.1 P1-DIMME1: 2.0.0 P1-DIMME2: 2.0.1 P1-DIMMF1: 2.1.0 P1-DIMMF2: 2.1.1 P1-DIMMG1: 3.0.0 P1-DIMMG2: 3.0.1 P1-DIMMH1: 3.1.0 P1-DIMMH2: 3.1.1 P2-DIMMA1: 4.0.0 P2-DIMMA2: 4.0.1 P2-DIMMB1: 4.1.0 P2-DIMMB2: 4.1.1 P2-DIMMC1: 5.0.0 P2-DIMMC2: 5.0.1 P2-DIMMD1: 5.1.0 P2-DIMMD2: 5.1.1 P2-DIMME1: 6.0.0 P2-DIMME2: 6.0.1 P2-DIMMF1: 6.1.0 P2-DIMMF2: 6.1.1 P2-DIMMG1: 7.0.0 P2-DIMMG2: 7.0.1 P2-DIMMH1: 7.1.0 P2-DIMMH2: 7.1.1 0707010000001C000041ED00000000000000000000000265C04BE400000000000000000000000000000000000000000000002200000000rasdaemon-0.8.0.49.git+f9cb13b/m40707010000001D000081A400000000000000000000000165C04BE400000057000000000000000000000000000000000000002D00000000rasdaemon-0.8.0.49.git+f9cb13b/m4/.gitignore# Autoreconf adds those libtool.m4 lt~obsolete.m4 ltoptions.m4 ltsugar.m4 ltversion.m4 0707010000001E000081A400000000000000000000000165C04BE4000004A2000000000000000000000000000000000000003300000000rasdaemon-0.8.0.49.git+f9cb13b/m4/ac_define_dir.m4dnl @synopsis AC_DEFINE_DIR(VARNAME, DIR [, DESCRIPTION]) dnl dnl This macro sets VARNAME to the expansion of the DIR variable, dnl taking care of fixing up ${prefix} and such. dnl dnl VARNAME is then offered as both an output variable and a C dnl preprocessor symbol. dnl dnl Example: dnl dnl AC_DEFINE_DIR([DATADIR], [datadir], [Where data are placed to.]) dnl dnl @category Misc dnl @author Stepan Kasal <kasal@ucw.cz> dnl @author Andreas Schwab <schwab@suse.de> dnl @author Guido U. Draheim <guidod@gmx.de> dnl @author Alexandre Oliva dnl @version 2006-10-13 dnl @license AllPermissive AC_DEFUN([AC_DEFINE_DIR], [ prefix_NONE= exec_prefix_NONE= test "x$prefix" = xNONE && prefix_NONE=yes && prefix=$ac_default_prefix test "x$exec_prefix" = xNONE && exec_prefix_NONE=yes && exec_prefix=$prefix dnl In Autoconf 2.60, ${datadir} refers to ${datarootdir}, which in turn dnl refers to ${prefix}. Thus we have to use `eval' twice. eval ac_define_dir="\"[$]$2\"" eval ac_define_dir="\"$ac_define_dir\"" AC_SUBST($1, "$ac_define_dir") AC_DEFINE_UNQUOTED($1, "$ac_define_dir", [$3]) test "$prefix_NONE" && prefix=NONE test "$exec_prefix_NONE" && exec_prefix=NONE ]) 0707010000001F000081A400000000000000000000000165C04BE400000DB5000000000000000000000000000000000000002F00000000rasdaemon-0.8.0.49.git+f9cb13b/m4/x_ac_meta.m4##***************************************************************************** ## $Id: x_ac_meta.m4 416 2005-07-20 17:50:27Z dun $ ##***************************************************************************** # AUTHOR: # Chris Dunlap <cdunlap@llnl.gov> # # SYNOPSIS: # X_AC_META # # DESCRIPTION: # Read metadata from the META file. ##***************************************************************************** AC_DEFUN([X_AC_META], [ AC_MSG_CHECKING([metadata]) META="$srcdir/META" _x_ac_meta_got_file=no if test -f "$META"; then _x_ac_meta_got_file=yes META_NAME=_X_AC_META_GETVAL([(?:NAME|PROJECT|PACKAGE)]); if test -n "$META_NAME"; then AC_DEFINE_UNQUOTED([META_NAME], ["$META_NAME"], [Define the project name.] ) AC_SUBST([META_NAME]) fi META_VERSION=_X_AC_META_GETVAL([VERSION]); if test -n "$META_VERSION"; then AC_DEFINE_UNQUOTED([META_VERSION], ["$META_VERSION"], [Define the project version.] ) AC_SUBST([META_VERSION]) fi META_RELEASE=_X_AC_META_GETVAL([RELEASE]); if test -n "$META_RELEASE"; then AC_DEFINE_UNQUOTED([META_RELEASE], ["$META_RELEASE"], [Define the project release.] ) AC_SUBST([META_RELEASE]) fi if test -n "$META_NAME" -a -n "$META_VERSION"; then META_ALIAS="$META_NAME-$META_VERSION" test -n "$META_RELEASE" && META_ALIAS="$META_ALIAS-$META_RELEASE" AC_DEFINE_UNQUOTED([META_ALIAS], ["$META_ALIAS"], [Define the project alias string (name-ver or name-ver-rel).] ) AC_SUBST([META_ALIAS]) fi META_DATE=_X_AC_META_GETVAL([DATE]); if test -n "$META_DATE"; then AC_DEFINE_UNQUOTED([META_DATE], ["$META_DATE"], [Define the project release date.] ) AC_SUBST([META_DATE]) fi META_AUTHOR=_X_AC_META_GETVAL([AUTHOR]); if test -n "$META_AUTHOR"; then AC_DEFINE_UNQUOTED([META_AUTHOR], ["$META_AUTHOR"], [Define the project author.] ) AC_SUBST([META_AUTHOR]) fi m4_pattern_allow([^LT_(CURRENT|REVISION|AGE)$]) META_LT_CURRENT=_X_AC_META_GETVAL([LT_CURRENT]); META_LT_REVISION=_X_AC_META_GETVAL([LT_REVISION]); META_LT_AGE=_X_AC_META_GETVAL([LT_AGE]); if test -n "$META_LT_CURRENT" \ -o -n "$META_LT_REVISION" \ -o -n "$META_LT_AGE"; then test -n "$META_LT_CURRENT" || META_LT_CURRENT="0" test -n "$META_LT_REVISION" || META_LT_REVISION="0" test -n "$META_LT_AGE" || META_LT_AGE="0" AC_DEFINE_UNQUOTED([META_LT_CURRENT], ["$META_LT_CURRENT"], [Define the libtool library 'current' version information.] ) AC_DEFINE_UNQUOTED([META_LT_REVISION], ["$META_LT_REVISION"], [Define the libtool library 'revision' version information.] ) AC_DEFINE_UNQUOTED([META_LT_AGE], ["$META_LT_AGE"], [Define the libtool library 'age' version information.] ) AC_SUBST([META_LT_CURRENT]) AC_SUBST([META_LT_REVISION]) AC_SUBST([META_LT_AGE]) fi fi AC_MSG_RESULT([$_x_ac_meta_got_file]) ] ) AC_DEFUN([_X_AC_META_GETVAL], [`perl -n\ -e "BEGIN { \\$key=shift @ARGV; }"\ -e "next unless s/^\s*\\$key@<:@:=@:>@//i;"\ -e "s/^((?:@<:@^'\"#@:>@*(?:(@<:@'\"@:>@)@<:@^\2@:>@*\2)*)*)#.*/\\@S|@1/;"\ -e "s/^\s+//;"\ -e "s/\s+$//;"\ -e "s/^(@<:@'\"@:>@)(.*)\1/\\@S|@2/;"\ -e "\\$val=\\$_;"\ -e "END { print \\$val if defined \\$val; }"\ '$1' $META`]dnl ) 07070100000020000041ED00000000000000000000000265C04BE400000000000000000000000000000000000000000000002300000000rasdaemon-0.8.0.49.git+f9cb13b/man07070100000021000081A400000000000000000000000165C04BE400000019000000000000000000000000000000000000002E00000000rasdaemon-0.8.0.49.git+f9cb13b/man/.gitignorerasdaemon.1 ras-mc-ctl.8 07070100000022000081A400000000000000000000000165C04BE400000025000000000000000000000000000000000000002F00000000rasdaemon-0.8.0.49.git+f9cb13b/man/Makefile.amman_MANS = ras-mc-ctl.8 rasdaemon.1 07070100000023000081A400000000000000000000000165C04BE4000013D4000000000000000000000000000000000000003300000000rasdaemon-0.8.0.49.git+f9cb13b/man/ras-mc-ctl.8.in.\"**************************************************************************** .\" $Id$ .\"**************************************************************************** .\"Copyright (c) 2013 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> .\"This tool is a modification of the edac-ctl, written as part of the .\"edac-utils: .\" Copyright (C) 2006-2007 The Regents of the University of California. .\" Produced at Lawrence Livermore National Laboratory. .\" Written by Mark Grondona <mgrondona@llnl.gov> .\" UCRL-CODE-230739. .\" .\" This is free software; you can redistribute it and/or modify it .\" under the terms of the GNU General Public License as published by .\" the Free Software Foundation; either version 2 of the License, or .\" (at your option) any later version. .\" .\" This is distributed in the hope that it will be useful, but WITHOUT .\" ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or .\" FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License .\" for more details. .\" .\" You should have received a copy of the GNU General Public License along .\" with this program; if not, write to the Free Software Foundation, Inc., .\" 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. .\"**************************************************************************** .TH RAS-MC-CTL 8 "@META_DATE@" "@META_ALIAS@" "RAS memory controller admin utility" .SH NAME ras-mc-ctl \- RAS memory controller admin utility .SH SYNOPSIS .B ras-mc-ctl [\fIOPTION\fR]... .SH DESCRIPTION The \fBras-mc-ctl\fR program is a \fBperl\fR(1) script which performs some useful RAS administration tasks on EDAC (Error Detection and Correction) drivers. .SH OPTIONS .TP .BI "--help" Display a brief usage message. .TP .BI "--mainboard" Print mainboard vendor and model for this hardware, if available. The method used by \fBras-mc-ctl\fR to obtain the mainboard vendor and model information for the current system is described below in the \fIMAINBOARD CONFIGURATION\fR section. .TP .BI "--status" Print the status of EDAC drivers (loaded or unloaded). .TP .BI "--quiet" Be less verbose when executing an operation. .TP .BI "--register-labels" Register motherboard DIMM labels into EDAC driver sysfs files. This option uses the detected mainboard manufacturer and model number in combination with a "labels database" found in any of the files under @sysconfdir@/ras/dimm_labels.d/* or in the labels.db file at @sysconfdir@/ras/dimm_labels.db. An entry for the current hardware must exist in the labels database for this option to do anything. .TP .BI "--print-labels" Display the configured labels for the current hardware, as well as the current labels registered with EDAC. .TP .BI "--guess-labels" Print DMI labels, when bank locator is available in the DMI table. It helps to fill the labels database at @sysconfdir@/ras/dimm_labels.d/. .TP .BI "--labeldb="DB Specify an alternate location for the labels database. .TP .BI "--delay="time Specify a delay of \fBtime\fR seconds before registering DIMM labels. Only meaninful if used together with --register-labels. .TP .BI "--layout" Prints the memory layout as detected by the EDAC driver. Useful to check if the EDAC driver is properly detecting the memory controller architecture. .TP .BI "--summary" Presents a summary of the logged errors. .TP .BI "--errors" Shows the errors stored at the error database. .TP .BI "--error-count" Shows the corrected and uncorrected error counts using sysfs. .TP .BI "--vendor-errors-summary="platform-id Pressents a summary of the vendor-specific logged errors. .TP .BI "--vendor-errors="platform-id Shows the vendor-specific errors stored in the error database. .TP .BI "--vendor-platforms" Shows the supported platforms with platform-ids for the vendor-specific errors. .SH MAINBOARD CONFIGURATION .PP The \fBras-mc-ctl\fR script uses the following method to determine the current system's mainboard vendor and model information: .IP "1." 4 If the config file @sysconfdir@/edac/mainboard exists, then it is parsed by \fBras-mc-ctl\fR. The mainboard config file has the following simple syntax: .nf vendor = <mainboard vendor string> model = <mainboard model string> script = <script to gather mainboard information> .fi Where anything after a '#' character on a line is considered a comment. If the keyword \fBscript\fR is specified, then that script or executable is run by \fBras-mc-ctl\fR to gather the mainboard vendor and model information. The script should write the resulting information on stdout in the same format as the mainboard config file. .IP "2." If no mainboard config file exists, then \fBras-mc-ctl\fR will attempt to read DMI information from the sysfs files .nf /sys/class/dmi/id/board_vendor /sys/class/dmi/id/board_name .fi .IP "3." If the sysfs files above do not exist, then \fBras-mc-ctl\fR will fall back to parsing output of the \fBdmidecode\fR(8) utility. Use of this utility will most often require that \fBras-mc-ctl\fR be run as root. .SH SEE ALSO \fBrasdaemon\fR(1) 07070100000024000081A400000000000000000000000165C04BE4000009D1000000000000000000000000000000000000003200000000rasdaemon-0.8.0.49.git+f9cb13b/man/rasdaemon.1.in.\"**************************************************************************** .\" $Id$ .\"**************************************************************************** .\"Copyright (c) 2013 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> .\" .\" This is free software; you can redistribute it and/or modify it .\" under the terms of the GNU General Public License as published by .\" the Free Software Foundation; either version 2 of the License, or .\" (at your option) any later version. .\" .\" This is distributed in the hope that it will be useful, but WITHOUT .\" ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or .\" FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License .\" for more details. .\" .\" You should have received a copy of the GNU General Public License along .\" with this program; if not, write to the Free Software Foundation, Inc., .\" 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. .\"**************************************************************************** .TH RASDAEMON 8 "@META_DATE@" "@META_ALIAS@" "RAS memory controller admin utility" .SH NAME rasdaemon \- RAS daemon to log the RAS events. .SH SYNOPSIS .B rasdaemon [\fIOPTION\fR]... .SH DESCRIPTION The \fBrasdaemon\fR program is a daemon which monitors the platform Reliablity, Availability and Serviceability (RAS) reports from the Linux kernel trace events. These trace events are logged in /sys/kernel/debug/tracing, reporting them via syslog/journald. .SH OPTIONS .TP .BI "--usage" Display a brief usage message and exit. .TP .BI "--help" Display a help message and exit. .TP .BI "--disable" Disable RAS tracing events and exit. .TP .BI "--enable" Enable RAS tracing events and exit. .TP .BI "--foreground" Executes in foreground, printing the events at console. Useful for testing it, and to be used by systemd or Unix System V respan. If not specified, the program runs in daemon mode. .TP .BI "--record" Record RAS events via Sqlite3. The Sqlite3 database has the benefit of keeping a persistent record of the RAS events. This feature is used with the ras-mc-ctl utility. Note that rasdaemon may be compiled without this feature. .TP .BI "--version" Print the program version and exit. .SH CONFIG FILE The \fBrasdaemon\fR program supports a config file to set rasdaemon systemd service environment variables. By default the config file is read from /etc/sysconfig/rasdaemon. The general format is environmentname=value. .SH SEE ALSO \fBras-mc-ctl\fR(8) 07070100000025000081A400000000000000000000000165C04BE400002080000000000000000000000000000000000000002C00000000rasdaemon-0.8.0.49.git+f9cb13b/mce-amd-k8.c/* * Copyright (C) 2013 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> * * The code below were adapted from Andi Kleen/Intel/SuSe mcelog code, * released under GNU Public General License, v.2 * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <stdio.h> #include <string.h> #include "ras-mce-handler.h" #include "bitfield.h" #define K8_MCE_THRESHOLD_BASE (MCE_EXTENDED_BANK + 1) /* MCE_AMD */ #define K8_MCE_THRESHOLD_TOP (K8_MCE_THRESHOLD_BASE + 6 * 9) #define K8_MCELOG_THRESHOLD_DRAM_ECC (4 * 9 + 0) #define K8_MCELOG_THRESHOLD_LINK (4 * 9 + 1) #define K8_MCELOG_THRESHOLD_L3_CACHE (4 * 9 + 2) #define K8_MCELOG_THRESHOLD_FBDIMM (4 * 9 + 3) static const char *k8bank[] = { "data cache", "instruction cache", "bus unit", "load/store unit", "northbridge", "fixed-issue reoder" }; static const char *k8threshold[] = { [0 ... K8_MCELOG_THRESHOLD_DRAM_ECC - 1] = "Unknown threshold counter", [K8_MCELOG_THRESHOLD_DRAM_ECC] = "MC4_MISC0 DRAM threshold", [K8_MCELOG_THRESHOLD_LINK] = "MC4_MISC1 Link threshold", [K8_MCELOG_THRESHOLD_L3_CACHE] = "MC4_MISC2 L3 Cache threshold", [K8_MCELOG_THRESHOLD_FBDIMM] = "MC4_MISC3 FBDIMM threshold", [K8_MCELOG_THRESHOLD_FBDIMM + 1 ... K8_MCE_THRESHOLD_TOP - K8_MCE_THRESHOLD_BASE - 1] = "Unknown threshold counter", }; static const char *transaction[] = { "instruction", "data", "generic", "reserved" }; static const char *cachelevel[] = { "0", "1", "2", "generic" }; static const char *memtrans[] = { "generic error", "generic read", "generic write", "data read", "data write", "instruction fetch", "prefetch", "evict", "snoop", "?", "?", "?", "?", "?", "?", "?" }; static const char *partproc[] = { "local node origin", "local node response", "local node observed", "generic participation" }; static const char *timeout[] = { "request didn't time out", "request timed out" }; static const char *memoryio[] = { "memory", "res.", "i/o", "generic" }; static const char *nbextendederr[] = { "RAM ECC error", "CRC error", "Sync error", "Master abort", "Target abort", "GART error", "RMW error", "Watchdog error", "RAM Chipkill ECC error", "DEV Error", "Link Data Error", "Link Protocol Error", "NB Array Error", "DRAM Parity Error", "Link Retry", "Tablew Walk Data Error", "L3 Cache Data Error", "L3 Cache Tag Error", "L3 Cache LRU Error" }; static const char *highbits[32] = { [31] = "valid", [30] = "error overflow (multiple errors)", [29] = "error uncorrected", [28] = "error enable", [27] = "misc error valid", [26] = "error address valid", [25] = "processor context corrupt", [24] = "res24", [23] = "res23", /* 22-15 ecc syndrome bits */ [14] = "corrected ecc error", [13] = "uncorrected ecc error", [12] = "res12", [11] = "L3 subcache in error bit 1", [10] = "L3 subcache in error bit 0", [9] = "sublink or DRAM channel", [8] = "error found by scrub", /* 7-4 ht link number of error */ [3] = "err cpu3", [2] = "err cpu2", [1] = "err cpu1", [0] = "err cpu0", }; #define IGNORE_HIGHBITS ((1 << 31) || (1 << 28) || (1 << 26)) static void decode_k8_generic_errcode(struct mce_event *e) { char tmp_buf[4092]; unsigned short errcode = e->status & 0xffff; int n; /* Translate the highest bits */ n = bitfield_msg(tmp_buf, sizeof(tmp_buf), highbits, 32, IGNORE_HIGHBITS, 32, e->status); if (n) mce_snprintf(e->error_msg, "(%s) ", tmp_buf); if ((errcode & 0xfff0) == 0x0010) mce_snprintf(e->error_msg, "LB error '%s transaction, level %s'", transaction[(errcode >> 2) & 3], cachelevel[errcode & 3]); else if ((errcode & 0xff00) == 0x0100) mce_snprintf(e->error_msg, "memory/cache error '%s mem transaction, %s transaction, level %s'", memtrans[(errcode >> 4) & 0xf], transaction[(errcode >> 2) & 3], cachelevel[errcode & 3]); else if ((errcode & 0xf800) == 0x0800) mce_snprintf(e->error_msg, "bus error '%s, %s: %s mem transaction, %s access, level %s'", partproc[(errcode >> 9) & 0x3], timeout[(errcode >> 8) & 1], memtrans[(errcode >> 4) & 0xf], memoryio[(errcode >> 2) & 0x3], cachelevel[(errcode & 0x3)]); } static void decode_k8_dc_mc(struct mce_event *e) { unsigned short exterrcode = (e->status >> 16) & 0x0f; unsigned short errcode = e->status & 0xffff; if (e->status & (3ULL << 45)) { mce_snprintf(e->error_msg, "Data cache ECC error (syndrome %x)", (uint32_t)(e->status >> 47) & 0xff); if (e->status & (1ULL << 40)) mce_snprintf(e->error_msg, "found by scrubber"); } if ((errcode & 0xfff0) == 0x0010) mce_snprintf(e->error_msg, "TLB parity error in %s array", (exterrcode == 0) ? "physical" : "virtual"); } static void decode_k8_ic_mc(struct mce_event *e) { unsigned short exterrcode = (e->status >> 16) & 0x0f; unsigned short errcode = e->status & 0xffff; if (e->status & (3ULL << 45)) mce_snprintf(e->error_msg, "Instruction cache ECC error"); if ((errcode & 0xfff0) == 0x0010) mce_snprintf(e->error_msg, "TLB parity error in %s array", (exterrcode == 0) ? "physical" : "virtual"); } static void decode_k8_bu_mc(struct mce_event *e) { unsigned short exterrcode = (e->status >> 16) & 0x0f; if (e->status & (3ULL << 45)) mce_snprintf(e->error_msg, "L2 cache ECC error"); mce_snprintf(e->error_msg, "%s array error", !exterrcode ? "Bus or cache" : "Cache tag"); } static void decode_k8_nb_mc(struct mce_event *e, unsigned int *memerr) { unsigned short exterrcode = (e->status >> 16) & 0x0f; mce_snprintf(e->error_msg, "Northbridge %s", nbextendederr[exterrcode]); switch (exterrcode) { case 0: *memerr = 1; mce_snprintf(e->error_msg, "ECC syndrome = %x", (uint32_t)(e->status >> 47) & 0xff); break; case 8: *memerr = 1; mce_snprintf(e->error_msg, "Chipkill ECC syndrome = %x", (uint32_t)((((e->status >> 24) & 0xff) << 8) | ((e->status >> 47) & 0xff))); break; case 1: case 2: case 3: case 4: case 6: mce_snprintf(e->error_msg, "link number = %x", (uint32_t)(e->status >> 36) & 0xf); break; } } static void decode_k8_threashold(struct mce_event *e) { if (e->misc & MCI_THRESHOLD_OVER) mce_snprintf(e->error_msg, "Threshold error count overflow"); } static void bank_name(struct mce_event *e) { const char *s; if (e->bank < ARRAY_SIZE(k8bank)) s = k8bank[e->bank]; else if (e->bank >= K8_MCE_THRESHOLD_BASE && e->bank < K8_MCE_THRESHOLD_TOP) s = k8threshold[e->bank - K8_MCE_THRESHOLD_BASE]; else return; /* Use the generic parser for bank */ mce_snprintf(e->bank_name, "%s (bank=%d)", s, e->bank); } int parse_amd_k8_event(struct ras_events *ras, struct mce_event *e) { unsigned int ismemerr = 0; /* Don't handle GART errors */ if (e->bank == 4) { unsigned short exterrcode = (e->status >> 16) & 0x0f; if (exterrcode == 5 && (e->status & (1ULL << 61))) { return -1; } } bank_name(e); switch (e->bank) { case 0: decode_k8_dc_mc(e); decode_k8_generic_errcode(e); break; case 1: decode_k8_ic_mc(e); decode_k8_generic_errcode(e); break; case 2: decode_k8_bu_mc(e); decode_k8_generic_errcode(e); break; case 3: /* LS */ decode_k8_generic_errcode(e); break; case 4: decode_k8_nb_mc(e, &ismemerr); decode_k8_generic_errcode(e); break; case 5: /* FR */ decode_k8_generic_errcode(e); break; case K8_MCE_THRESHOLD_BASE ... K8_MCE_THRESHOLD_TOP: decode_k8_threashold(e); break; default: strcpy(e->error_msg, "Don't know how to decode this bank"); } /* IP doesn't matter on memory errors */ if (ismemerr) e->ip = 0; return 0; } 07070100000026000081A400000000000000000000000165C04BE400007FC0000000000000000000000000000000000000002E00000000rasdaemon-0.8.0.49.git+f9cb13b/mce-amd-smca.c/* * Copyright (c) 2018, AMD, Inc. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 and * only version 2 as published by the Free Software Foundation. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. */ #include <stdio.h> #include <string.h> #include "ras-mce-handler.h" #include "bitfield.h" /* MCA_STATUS REGISTER FOR FAMILY 17H *********************** Higher 32-bits ***************************** * 63: VALIDERROR, 62: OVERFLOW, 61: UC, 60: Err ENABLE, * 59: Misc Valid, 58: Addr Valid, 57: PCC, 56: ErrCoreID Valid, * 55: TCC, 54: RES, 53: Syndrom Valid, 52: Transparanet, * 51: RES, 50: RES, 49: RES, 48: RES, * 47: RES, 46: CECC, 45: UECC, 44: Deferred, * 43: Poison, 42: RES, 41: RES, 40: RES, * 39: RES, 38: RES, 37: ErrCoreID[5], 36: ErrCoreID[4], * 35: ErrCoreID[3], 34: ErrCoreID[2] 33: ErrCoreID[1] 32: ErrCoreID[0] *********************** Lower 32-bits ****************************** * 31: RES, 30: RES, 29: RES, 28: RES, * 27: RES, 26: RES, 25: RES, 24: RES * 23: RES, 22: RES, 21: XEC[5], 20: XEC[4], * 19: XEC[3], 18: XEC[2], 17: XEC[1], 16: XEC[0] * 15: EC[15], 14: EC[14], 13: EC[13], 12: EC[12], * 11: EC[11], 10: EC[10], 09: EC[9], 08: EC[8], * 07: EC[7], 06: EC[6], 05: EC[5], 04: EC[4], * 03: EC[3], 02: EC[2], 01: EC[1], 00: EC[0] */ /* MCA_STATUS REGISTER FOR FAMILY 19H * The bits 24 ~ 29 contains AddressLsb * 29: ADDRLS[5], 28: ADDRLS[4], 27: ADDRLS[3], * 26: ADDRLS[2], 25: ADDRLS[1], 24: ADDRLS[0] */ /* These may be used by multiple smca_hwid_mcatypes */ enum smca_bank_types { SMCA_LS = 0, /* Load Store */ SMCA_LS_V2, SMCA_IF, /* Instruction Fetch */ SMCA_L2_CACHE, /* L2 Cache */ SMCA_DE, /* Decoder Unit */ SMCA_RESERVED, /* Reserved */ SMCA_EX, /* Execution Unit */ SMCA_FP, /* Floating Point */ SMCA_L3_CACHE, /* L3 Cache */ SMCA_CS, /* Coherent Slave */ SMCA_CS_V2, SMCA_CS_V2_QUIRK, SMCA_PIE, /* Power, Interrupts, etc. */ SMCA_UMC, /* Unified Memory Controller */ SMCA_UMC_QUIRK, SMCA_UMC_V2, SMCA_MA_LLC, /* Memory Attached Last Level Cache */ SMCA_PB, /* Parameter Block */ SMCA_PSP, /* Platform Security Processor */ SMCA_PSP_V2, SMCA_SMU, /* System Management Unit */ SMCA_SMU_V2, SMCA_MP5, /* Microprocessor 5 Unit */ SMCA_MPDMA, /* MPDMA Unit */ SMCA_NBIO, /* Northbridge IO Unit */ SMCA_PCIE, /* PCI Express Unit */ SMCA_PCIE_V2, SMCA_XGMI_PCS, /* xGMI PCS Unit */ SMCA_NBIF, /* NBIF Unit */ SMCA_SHUB, /* System Hub Unit */ SMCA_SATA, /* SATA Unit */ SMCA_USB, /* USB Unit */ SMCA_USR_DP, /* Ultra Short Reach Data Plane Controller */ SMCA_USR_CP, /* Ultra Short Reach Control Plane Controller */ SMCA_GMI_PCS, /* GMI PCS Unit */ SMCA_XGMI_PHY, /* xGMI PHY Unit */ SMCA_WAFL_PHY, /* WAFL PHY Unit */ SMCA_GMI_PHY, /* GMI PHY Unit */ N_SMCA_BANK_TYPES }; /* Maximum number of MCA banks per CPU. */ #define MAX_NR_BANKS 64 #define MCI_IPID_MCATYPE 0xFFFF0000 #define MCI_IPID_HWID 0xFFF /* Obtain HWID_MCATYPE Tuple on SMCA Systems */ #define HWID_MCATYPE(hwid, mcatype) (((hwid) << 16) | (mcatype)) /* * On Newer heterogeneous systems from AMD with CPU and GPU nodes connected * via xGMI links, the NON CPU Nodes are enumerated from index 8 */ #define NONCPU_NODE_INDEX 8 /* SMCA Extended error strings */ static const char * const smca_ls_mce_desc[] = { "Load queue parity", "Store queue parity", "Miss address buffer payload parity", "L1 TLB parity", "Reserved", "DC tag error type 6", "DC tag error type 1", "Internal error type 1", "Internal error type 2", "Sys Read data error thread 0", "Sys read data error thread 1", "DC tag error type 2", "DC data error type 1 (poison consumption)", "DC data error type 2", "DC data error type 3", "DC tag error type 4", "L2 TLB parity", "PDC parity error", "DC tag error type 3", "DC tag error type 5", "L2 fill data error", }; static const char * const smca_ls2_mce_desc[] = { "An ECC error was detected on a data cache read by a probe or victimization", "An ECC error or L2 poison was detected on a data cache read by a load", "An ECC error was detected on a data cache read-modify-write by a store", "An ECC error or poison bit mismatch was detected on a tag read by a probe or victimization", "An ECC error or poison bit mismatch was detected on a tag read by a load", "An ECC error or poison bit mismatch was detected on a tag read by a store", "An ECC error was detected on an EMEM read by a load", "An ECC error was detected on an EMEM read-modify-write by a store", "A parity error was detected in an L1 TLB entry by any access", "A parity error was detected in an L2 TLB entry by any access", "A parity error was detected in a PWC entry by any access", "A parity error was detected in an STQ entry by any access", "A parity error was detected in an LDQ entry by any access", "A parity error was detected in a MAB entry by any access", "A parity error was detected in an SCB entry state field by any access", "A parity error was detected in an SCB entry address field by any access", "A parity error was detected in an SCB entry data field by any access", "A parity error was detected in a WCB entry by any access", "A poisoned line was detected in an SCB entry by any access", "A SystemReadDataError error was reported on read data returned from L2 for a load", "A SystemReadDataError error was reported on read data returned from L2 for an SCB store", "A SystemReadDataError error was reported on read data returned from L2 for a WCB store", "A hardware assertion error was reported", "A parity error was detected in an STLF, SCB EMEM entry, store data mask or SRB store data by any access", }; static const char * const smca_if_mce_desc[] = { "microtag probe port parity error", "IC microtag or full tag multi-hit error", "IC full tag parity", "IC data array parity", "PRQ Parity Error", "L0 ITLB parity error", "L1-TLB parity error", "L2-TLB parity error", "BPQ snoop parity on Thread 0", "BPQ snoop parity on Thread 1", "BP L1-BTB Multi-Hit Error", "BP L2-BTB Multi-Hit Error", "L2 Cache Response Poison error", "L2 Cache Error Response", "Hardware Assertion Error", "L1-TLB Multi-Hit", "L2-TLB Multi-Hit", "BSR Parity Error", "CT MCE", }; static const char * const smca_l2_mce_desc[] = { "L2M Tag Multiple-Way-Hit error", "L2M Tag or State Array ECC Error", "L2M Data Array ECC Error", "Hardware Assert Error", "SDP Read Response Parity Error", }; static const char * const smca_de_mce_desc[] = { "Micro-op cache tag array parity error", "Micro-op cache data array parity error", "IBB Register File parity error", "Micro-op queue parity error", "Instruction dispatch queue parity error", "Fetch address FIFO parity error", "Patch RAM data parity error", "Patch RAM sequencer parity error", "Micro-op buffer parity error", "Hardware Assertion MCA Error", }; static const char * const smca_ex_mce_desc[] = { "Watchdog timeout error", "Physical register file parity error", "Flag register file parity error", "Immediate displacement register file parity error", "Address generator payload parity error", "EX payload parity error", "Checkpoint queue parity error", "Retire dispatch queue parity error", "Retire status queue parity error", "Scheduler queue parity error", "Branch buffer queue parity error", "Hardware Assertion error", "Spec Map parity error", "Retire Map parity error", }; static const char * const smca_fp_mce_desc[] = { "Physical register file (PRF) parity error", "Freelist (FL) parity error", "Schedule queue parity error", "NSQ parity error", "Retire queue (RQ) parity error", "Status register file (SRF) parity error", "Hardware assertion", "Physical K mask register file (KRF) parity error", }; static const char * const smca_l3_mce_desc[] = { "Shadow tag macro ECC error", "Shadow tag macro multi-way-hit error", "L3M tag ECC error", "L3M tag multi-way-hit error", "L3M data ECC error", "SDP Parity Error from XI", "L3 victim queue Data Fabric error", "L3 Hardware Assertion", "XI WCB Parity Poison Creation event", }; static const char * const smca_cs_mce_desc[] = { "Illegal request", "Address violation", "Security violation", "Illegal response", "Unexpected response", "Request or Probe Parity Error", "Read Response Parity Error", "Atomic request parity error", "Probe Filter ECC Error", }; static const char * const smca_cs2_mce_desc[] = { "Illegal Request", "Address Violation", "Security Violation", "Illegal Response", "Unexpected Response", "Request or Probe Parity Error", "Read Response Parity Error", "Atomic Request Parity Error", "SDP read response had no match in the CS queue", "Probe Filter Protocol Error", "Probe Filter ECC Error", "SDP read response had an unexpected RETRY error", "Counter overflow error", "Counter underflow error", "Illegal Request on the no data channel", "Address Violation on the no data channel", "Security Violation on the no data channel", "Hardware Assert Error", }; /* * Per Genoa's revision guide, erratum 1384, existing bit definitions * are reassigned for SMCA CS bank type. */ static const char * const smca_cs2_quirk_mce_desc[] = { "Illegal Request", "Address Violation", "Security Violation", "Illegal Response", "Unexpected Response", "Request or Probe Parity Error", "Read Response Parity Error", "Atomic Request Parity Error", "SDP read response had no match in the CS queue", "SDP read response had an unexpected RETRY error", "Counter overflow error", "Counter underflow error", "Probe Filter Protocol Error", "Probe Filter ECC Error", "Illegal Request on the no data channel", "Address Violation on the no data channel", "Security Violation on the no data channel", "Hardware Assert Error", }; static const char * const smca_pie_mce_desc[] = { "Hardware assert", "Register security violation", "Link error", "Poison data consumption", "A deferred error was detected in the DF", "Watch Dog Timer", "An SRAM ECC error was detected in the CNLI block", }; static const char * const smca_umc_mce_desc[] = { "DRAM ECC error", "Data poison error on DRAM", "SDP parity error", "Advanced peripheral bus error", "Command/address parity error", "Write data CRC error", "DCQ SRAM ECC error", "AES SRAM ECC error", "ECS Row Error", "ECS Error", "UMC Throttling Error", "Read CRC Error", }; static const char * const smca_umc_quirk_mce_desc[] = { "DRAM On Die ECC error", "Data poison error", "SDP parity error", "Reserved", "Address/Command parity error", "HBM Write data parity error", "Consolidated SRAM ECC error", "Reserved", "Reserved", "Rdb SRAM ECC error", "Thermal throttling", "HBM Read Data Parity error", "Reserved", "UMC FW Error", "SRAM Parity Error", "HBM CRC Error", }; static const char * const smca_umc2_mce_desc[] = { "DRAM ECC error", "Data poison error", "SDP parity error", "Reserved", "Address/Command parity error", "Write data parity error", "DCQ SRAM ECC error", "Reserved", "Read data parity error", "Rdb SRAM ECC error", "RdRsp SRAM ECC error", "LM32 MP errors", }; static const char * const smca_mall_mce_desc[] = { "Counter overflow error", "Counter underflow error", "Write Data Parity Error", "Read Response Parity Error", "Cache Tag ECC Error Macro 0", "Cache Tag ECC Error Macro 1", "Cache Data ECC Error" }; static const char * const smca_pb_mce_desc[] = { "An ECC error in the Parameter Block RAM array" }; static const char * const smca_psp_mce_desc[] = { "An ECC or parity error in a PSP RAM instance", }; static const char * const smca_psp2_mce_desc[] = { "High SRAM ECC or parity error", "Low SRAM ECC or parity error", "Instruction Cache Bank 0 ECC or parity error", "Instruction Cache Bank 1 ECC or parity error", "Instruction Tag Ram 0 parity error", "Instruction Tag Ram 1 parity error", "Data Cache Bank 0 ECC or parity error", "Data Cache Bank 1 ECC or parity error", "Data Cache Bank 2 ECC or parity error", "Data Cache Bank 3 ECC or parity error", "Data Tag Bank 0 parity error", "Data Tag Bank 1 parity error", "Data Tag Bank 2 parity error", "Data Tag Bank 3 parity error", "Dirty Data Ram parity error", "TLB Bank 0 parity error", "TLB Bank 1 parity error", "System Hub Read Buffer ECC or parity error", }; static const char * const smca_smu_mce_desc[] = { "An ECC or parity error in an SMU RAM instance", }; static const char * const smca_smu2_mce_desc[] = { "High SRAM ECC or parity error", "Low SRAM ECC or parity error", "Data Cache Bank A ECC or parity error", "Data Cache Bank B ECC or parity error", "Data Tag Cache Bank A ECC or parity error", "Data Tag Cache Bank B ECC or parity error", "Instruction Cache Bank A ECC or parity error", "Instruction Cache Bank B ECC or parity error", "Instruction Tag Cache Bank A ECC or parity error", "Instruction Tag Cache Bank B ECC or parity error", "System Hub Read Buffer ECC or parity error", }; static const char * const smca_mp5_mce_desc[] = { "High SRAM ECC or parity error", "Low SRAM ECC or parity error", "Data Cache Bank A ECC or parity error", "Data Cache Bank B ECC or parity error", "Data Tag Cache Bank A ECC or parity error", "Data Tag Cache Bank B ECC or parity error", "Instruction Cache Bank A ECC or parity error", "Instruction Cache Bank B ECC or parity error", "Instruction Tag Cache Bank A ECC or parity error", "Instruction Tag Cache Bank B ECC or parity error", }; static const char * const smca_mpdma_mce_desc[] = { "Main SRAM [31:0] bank ECC or parity error", "Main SRAM [63:32] bank ECC or parity error", "Main SRAM [95:64] bank ECC or parity error", "Main SRAM [127:96] bank ECC or parity error", "Data Cache Bank A ECC or parity error", "Data Cache Bank B ECC or parity error", "Data Tag Cache Bank A ECC or parity error", "Data Tag Cache Bank B ECC or parity error", "Instruction Cache Bank A ECC or parity error", "Instruction Cache Bank B ECC or parity error", "Instruction Tag Cache Bank A ECC or parity error", "Instruction Tag Cache Bank B ECC or parity error", "Data Cache Bank A ECC or parity error", "Data Cache Bank B ECC or parity error", "Data Tag Cache Bank A ECC or parity error", "Data Tag Cache Bank B ECC or parity error", "Instruction Cache Bank A ECC or parity error", "Instruction Cache Bank B ECC or parity error", "Instruction Tag Cache Bank A ECC or parity error", "Instruction Tag Cache Bank B ECC or parity error", "Data Cache Bank A ECC or parity error", "Data Cache Bank B ECC or parity error", "Data Tag Cache Bank A ECC or parity error", "Data Tag Cache Bank B ECC or parity error", "Instruction Cache Bank A ECC or parity error", "Instruction Cache Bank B ECC or parity error", "Instruction Tag Cache Bank A ECC or parity error", "Instruction Tag Cache Bank B ECC or parity error", "System Hub Read Buffer ECC or parity error", "MPDMA TVF DVSEC Memory ECC or parity error", "MPDMA TVF MMIO Mailbox0 ECC or parity error", "MPDMA TVF MMIO Mailbox1 ECC or parity error", "MPDMA TVF Doorbell Memory ECC or parity error", "MPDMA TVF SDP Slave Memory 0 ECC or parity error", "MPDMA TVF SDP Slave Memory 1 ECC or parity error", "MPDMA TVF SDP Slave Memory 2 ECC or parity error", "MPDMA TVF SDP Master Memory 0 ECC or parity error", "MPDMA TVF SDP Master Memory 1 ECC or parity error", "MPDMA TVF SDP Master Memory 2 ECC or parity error", "MPDMA TVF SDP Master Memory 3 ECC or parity error", "MPDMA TVF SDP Master Memory 4 ECC or parity error", "MPDMA TVF SDP Master Memory 5 ECC or parity error", "MPDMA TVF SDP Master Memory 6 ECC or parity error", "SDP Watchdog Timer expired", "MPDMA PTE Command FIFO ECC or parity error", "MPDMA PTE Hub Data FIFO ECC or parity error", "MPDMA PTE Internal Data FIFO ECC or parity error", "MPDMA PTE Command Memory DMA ECC or parity error", "MPDMA PTE Command Memory Internal ECC or parity error", }; static const char * const smca_nbio_mce_desc[] = { "ECC or Parity error", "PCIE error", "External SDP ErrEvent error", "SDP Egress Poison error", "Internal Poison error", "Internal system fatal error event", }; static const char * const smca_pcie_mce_desc[] = { "CCIX PER Message logging", "CCIX Read Response with Status: Non-Data Error", "CCIX Write Response with Status: Non-Data Error", "CCIX Read Response with Status: Data Error", "CCIX Non-okay write response with data error", }; static const char * const smca_pcie2_mce_desc[] = { "SDP Data Parity Error logging", }; static const char * const smca_xgmipcs_mce_desc[] = { "Data Loss Error", "Training Error", "Flow Control Acknowledge Error", "Rx Fifo Underflow Error", "Rx Fifo Overflow Error", "CRC Error", "BER Exceeded Error", "Tx Vcid Data Error", "Replay Buffer Parity Error", "Data Parity Error", "Replay Fifo Overflow Error", "Replay Fifo Underflow Error", "Elastic Fifo Overflow Error", "Deskew Error", "Flow Control CRC Error", "Data Startup Limit Error", "FC Init Timeout Error", "Recovery Timeout Error", "Ready Serial Timeout Error", "Ready Serial Attempt Error", "Recovery Attempt Error", "Recovery Relock Attempt Error", "Replay Attempt Error", "Sync Header Error", "Tx Replay Timeout Error", "Rx Replay Timeout Error", "LinkSub Tx Timeout Error", "LinkSub Rx Timeout Error", "Rx CMD Pocket Error", }; static const char * const smca_xgmiphy_mce_desc[] = { "RAM ECC Error", "ARC instruction buffer parity error", "ARC data buffer parity error", "PHY APB error", }; static const char * const smca_nbif_mce_desc[] = { "Timeout error from GMI", "SRAM ECC error", "NTB Error Event", "SDP Parity error", }; static const char * const smca_sata_mce_desc[] = { "Parity error for port 0", "Parity error for port 1", "Parity error for port 2", "Parity error for port 3", "Parity error for port 4", "Parity error for port 5", "Parity error for port 6", "Parity error for port 7", }; static const char * const smca_usb_mce_desc[] = { "Parity error or ECC error for S0 RAM0", "Parity error or ECC error for S0 RAM1", "Parity error or ECC error for S0 RAM2", "Parity error for PHY RAM0", "Parity error for PHY RAM1", "AXI Slave Response error", }; static const char * const smca_usrdp_mce_desc[] = { "Mst CMD Error", "Mst Rx FIFO Error", "Mst Deskew Error", "Mst Detect Timeout Error", "Mst FlowControl Error", "Mst DataValid FIFO Error", "Mac LinkState Error", "Deskew Error", "Init Timeout Error", "Init Attempt Error", "Recovery Timeout Error", "Recovery Attempt Error", "Eye Training Timeout Error", "Data Startup Limit Error", "LS0 Exit Error", "PLL powerState Update Timeout Error", "Rx FIFO Error", "Lcu Error", "Conv CECC Error", "Conv UECC Error", "Reserved", "Rx DataLoss Error", "Replay CECC Error", "Replay UECC Error", "CRC Error", "BER Exceeded Error", "FC Init Timeout Error", "FC Init Attempt Error", "Replay Timeout Error", "Replay Attempt Error", "Replay Underflow Error", "Replay Overflow Error", }; static const char * const smca_usrcp_mce_desc[] = { "Packet Type Error", "Rx FIFO Error", "Deskew Error", "Rx Detect Timeout Error", "Data Parity Error", "Data Loss Error", "Lcu Error", "HB1 Handshake Timeout Error", "HB2 Handshake Timeout Error", "Clk Sleep Rsp Timeout Error", "Clk Wake Rsp Timeout Error", "Reset Attack Error", "Remote Link Fatal Error", }; static const char * const smca_gmipcs_mce_desc[] = { "Data Loss Error", "Training Error", "Replay Parity Error", "Rx Fifo Underflow Error", "Rx Fifo Overflow Error", "CRC Error", "BER Exceeded Error", "Tx Fifo Underflow Error", "Replay Buffer Parity Error", "Tx Overflow Error", "Replay Fifo Overflow Error", "Replay Fifo Underflow Error", "Elastic Fifo Overflow Error", "Deskew Error", "Offline Error", "Data Startup Limit Error", "FC Init Timeout Error", "Recovery Timeout Error", "Ready Serial Timeout Error", "Ready Serial Attempt Error", "Recovery Attempt Error", "Recovery Relock Attempt Error", "Deskew Abort Error", "Rx Buffer Error", "Rx LFDS Fifo Overflow Error", "Rx LFDS Fifo Underflow Error", "LinkSub Tx Timeout Error", "LinkSub Rx Timeout Error", "Rx CMD Packet Error", "LFDS Training Timeout Error", "LFDS FC Init Timeout Error", "Data Loss Error", }; struct smca_mce_desc { const char * const *descs; unsigned int num_descs; }; static struct smca_mce_desc smca_mce_descs[] = { [SMCA_LS] = { smca_ls_mce_desc, ARRAY_SIZE(smca_ls_mce_desc) }, [SMCA_LS_V2] = { smca_ls2_mce_desc, ARRAY_SIZE(smca_ls2_mce_desc) }, [SMCA_IF] = { smca_if_mce_desc, ARRAY_SIZE(smca_if_mce_desc) }, [SMCA_L2_CACHE] = { smca_l2_mce_desc, ARRAY_SIZE(smca_l2_mce_desc) }, [SMCA_DE] = { smca_de_mce_desc, ARRAY_SIZE(smca_de_mce_desc) }, [SMCA_EX] = { smca_ex_mce_desc, ARRAY_SIZE(smca_ex_mce_desc) }, [SMCA_FP] = { smca_fp_mce_desc, ARRAY_SIZE(smca_fp_mce_desc) }, [SMCA_L3_CACHE] = { smca_l3_mce_desc, ARRAY_SIZE(smca_l3_mce_desc) }, [SMCA_CS] = { smca_cs_mce_desc, ARRAY_SIZE(smca_cs_mce_desc) }, [SMCA_CS_V2] = { smca_cs2_mce_desc, ARRAY_SIZE(smca_cs2_mce_desc) }, [SMCA_CS_V2_QUIRK] = { smca_cs2_quirk_mce_desc, ARRAY_SIZE(smca_cs2_quirk_mce_desc)}, [SMCA_PIE] = { smca_pie_mce_desc, ARRAY_SIZE(smca_pie_mce_desc) }, [SMCA_UMC] = { smca_umc_mce_desc, ARRAY_SIZE(smca_umc_mce_desc) }, [SMCA_UMC_QUIRK] = { smca_umc_quirk_mce_desc, ARRAY_SIZE(smca_umc_quirk_mce_desc) }, [SMCA_UMC_V2] = { smca_umc2_mce_desc, ARRAY_SIZE(smca_umc2_mce_desc) }, [SMCA_MA_LLC] = { smca_mall_mce_desc, ARRAY_SIZE(smca_mall_mce_desc) }, [SMCA_PB] = { smca_pb_mce_desc, ARRAY_SIZE(smca_pb_mce_desc) }, [SMCA_PSP] = { smca_psp_mce_desc, ARRAY_SIZE(smca_psp_mce_desc) }, [SMCA_PSP_V2] = { smca_psp2_mce_desc, ARRAY_SIZE(smca_psp2_mce_desc)}, [SMCA_SMU] = { smca_smu_mce_desc, ARRAY_SIZE(smca_smu_mce_desc) }, [SMCA_SMU_V2] = { smca_smu2_mce_desc, ARRAY_SIZE(smca_smu2_mce_desc)}, [SMCA_MP5] = { smca_mp5_mce_desc, ARRAY_SIZE(smca_mp5_mce_desc) }, [SMCA_MPDMA] = { smca_mpdma_mce_desc, ARRAY_SIZE(smca_mpdma_mce_desc) }, [SMCA_NBIO] = { smca_nbio_mce_desc, ARRAY_SIZE(smca_nbio_mce_desc)}, [SMCA_PCIE] = { smca_pcie_mce_desc, ARRAY_SIZE(smca_pcie_mce_desc)}, [SMCA_PCIE_V2] = { smca_pcie2_mce_desc, ARRAY_SIZE(smca_pcie2_mce_desc) }, [SMCA_XGMI_PCS] = { smca_xgmipcs_mce_desc, ARRAY_SIZE(smca_xgmipcs_mce_desc) }, /* NBIF and SHUB have the same error descriptions, for now. */ [SMCA_NBIF] = { smca_nbif_mce_desc, ARRAY_SIZE(smca_nbif_mce_desc) }, [SMCA_SHUB] = { smca_nbif_mce_desc, ARRAY_SIZE(smca_nbif_mce_desc) }, [SMCA_SATA] = { smca_sata_mce_desc, ARRAY_SIZE(smca_sata_mce_desc) }, [SMCA_USB] = { smca_usb_mce_desc, ARRAY_SIZE(smca_usb_mce_desc) }, [SMCA_USR_DP] = { smca_usrdp_mce_desc, ARRAY_SIZE(smca_usrdp_mce_desc) }, [SMCA_USR_CP] = { smca_usrcp_mce_desc, ARRAY_SIZE(smca_usrcp_mce_desc) }, [SMCA_GMI_PCS] = { smca_gmipcs_mce_desc, ARRAY_SIZE(smca_gmipcs_mce_desc) }, /* All the PHY bank types have the same error descriptions, for now. */ [SMCA_XGMI_PHY] = { smca_xgmiphy_mce_desc, ARRAY_SIZE(smca_xgmiphy_mce_desc) }, [SMCA_WAFL_PHY] = { smca_xgmiphy_mce_desc, ARRAY_SIZE(smca_xgmiphy_mce_desc) }, [SMCA_GMI_PHY] = { smca_xgmiphy_mce_desc, ARRAY_SIZE(smca_xgmiphy_mce_desc) }, }; struct smca_hwid { unsigned int bank_type; /* Use with smca_bank_types for easy indexing.*/ uint32_t mcatype_hwid; /* mcatype,hwid bit 63-32 in MCx_IPID Register*/ }; static struct smca_hwid smca_hwid_mcatypes[] = { /* { bank_type, mcatype_hwid } */ /* ZN Core (HWID=0xB0) MCA types */ { SMCA_LS, HWID_MCATYPE(0xB0, 0x0) }, { SMCA_LS_V2, HWID_MCATYPE(0xB0, 0x10) }, { SMCA_IF, HWID_MCATYPE(0xB0, 0x1) }, { SMCA_L2_CACHE, HWID_MCATYPE(0xB0, 0x2) }, { SMCA_DE, HWID_MCATYPE(0xB0, 0x3) }, /* HWID 0xB0 MCATYPE 0x4 is Reserved */ { SMCA_EX, HWID_MCATYPE(0xB0, 0x5) }, { SMCA_FP, HWID_MCATYPE(0xB0, 0x6) }, { SMCA_L3_CACHE, HWID_MCATYPE(0xB0, 0x7) }, /* Data Fabric MCA types */ { SMCA_CS, HWID_MCATYPE(0x2E, 0x0) }, { SMCA_PIE, HWID_MCATYPE(0x2E, 0x1) }, { SMCA_CS_V2, HWID_MCATYPE(0x2E, 0x2) }, { SMCA_CS_V2_QUIRK, HWID_MCATYPE(0x0, 0x1) }, /* Unified Memory Controller MCA type */ { SMCA_UMC, HWID_MCATYPE(0x96, 0x0) }, { SMCA_UMC_QUIRK, HWID_MCATYPE(0x0, 0x2) }, /* Heterogeneous systems may have both UMC and UMC_v2 types on the same node. */ { SMCA_UMC_V2, HWID_MCATYPE(0x96, 0x1) }, /* Memory Attached Last Level Cache */ { SMCA_MA_LLC, HWID_MCATYPE(0x2E, 0x4) }, /* Parameter Block MCA type */ { SMCA_PB, HWID_MCATYPE(0x05, 0x0) }, /* Platform Security Processor MCA type */ { SMCA_PSP, HWID_MCATYPE(0xFF, 0x0) }, { SMCA_PSP_V2, HWID_MCATYPE(0xFF, 0x1) }, /* System Management Unit MCA type */ { SMCA_SMU, HWID_MCATYPE(0x01, 0x0) }, { SMCA_SMU_V2, HWID_MCATYPE(0x01, 0x1) }, /* Microprocessor 5 Unit MCA type */ { SMCA_MP5, HWID_MCATYPE(0x01, 0x2) }, /* MPDMA MCA Type */ { SMCA_MPDMA, HWID_MCATYPE(0x01, 0x3) }, /* Northbridge IO Unit MCA type */ { SMCA_NBIO, HWID_MCATYPE(0x18, 0x0) }, /* PCI Express Unit MCA type */ { SMCA_PCIE, HWID_MCATYPE(0x46, 0x0) }, { SMCA_PCIE_V2, HWID_MCATYPE(0x46, 0x1) }, /* Ext Global Memory Interconnect PCS MCA type */ { SMCA_XGMI_PCS, HWID_MCATYPE(0x50, 0x0) }, { SMCA_NBIF, HWID_MCATYPE(0x6C, 0x0) }, { SMCA_SHUB, HWID_MCATYPE(0x80, 0x0) }, { SMCA_SATA, HWID_MCATYPE(0xA8, 0x0) }, { SMCA_USB, HWID_MCATYPE(0xAA, 0x0) }, /* Ultra Short Reach Data and Control Plane Controller */ { SMCA_USR_DP, HWID_MCATYPE(0x170, 0x0) }, { SMCA_USR_CP, HWID_MCATYPE(0x180, 0x0) }, { SMCA_GMI_PCS, HWID_MCATYPE(0x241, 0x0) }, /* Ext Global Memory Interconnect PHY MCA type */ { SMCA_XGMI_PHY, HWID_MCATYPE(0x259, 0x0) }, /* WAFL PHY MCA type */ { SMCA_WAFL_PHY, HWID_MCATYPE(0x267, 0x0) }, { SMCA_GMI_PHY, HWID_MCATYPE(0x269, 0x0) }, }; struct smca_bank_name { const char *name; }; static struct smca_bank_name smca_names[] = { [SMCA_LS ... SMCA_LS_V2] = { "Load Store Unit" }, [SMCA_IF] = { "Instruction Fetch Unit" }, [SMCA_L2_CACHE] = { "L2 Cache" }, [SMCA_DE] = { "Decode Unit" }, [SMCA_RESERVED] = { "Reserved" }, [SMCA_EX] = { "Execution Unit" }, [SMCA_FP] = { "Floating Point Unit" }, [SMCA_L3_CACHE] = { "L3 Cache" }, [SMCA_CS ... SMCA_CS_V2_QUIRK] = { "Coherent Slave" }, [SMCA_PIE] = { "Power, Interrupts, etc." }, [SMCA_UMC ... SMCA_UMC_QUIRK] = { "Unified Memory Controller" }, [SMCA_UMC_V2] = { "Unified Memory Controller V2" }, [SMCA_MA_LLC] = { "Memory Attached Last Level Cache" }, [SMCA_PB] = { "Parameter Block" }, [SMCA_PSP ... SMCA_PSP_V2] = { "Platform Security Processor" }, [SMCA_SMU ... SMCA_SMU_V2] = { "System Management Unit" }, [SMCA_MP5] = { "Microprocessor 5 Unit" }, [SMCA_MPDMA] = { "MPDMA Unit" }, [SMCA_NBIO] = { "Northbridge IO Unit" }, [SMCA_PCIE ... SMCA_PCIE_V2] = { "PCI Express Unit" }, [SMCA_XGMI_PCS] = { "Ext Global Memory Interconnect PCS Unit" }, [SMCA_NBIF] = { "NBIF Unit" }, [SMCA_SHUB] = { "System Hub Unit" }, [SMCA_SATA] = { "SATA Unit" }, [SMCA_USB] = { "USB Unit" }, [SMCA_USR_DP] = { "Ultra Short Reach Data Plane Controller" }, [SMCA_USR_CP] = { "Ultra Short Reach Control Plane Controller" }, [SMCA_GMI_PCS] = { "Global Memory Interconnect PCS Unit" }, [SMCA_XGMI_PHY] = { "Ext Global Memory Interconnect PHY Unit" }, [SMCA_WAFL_PHY] = { "WAFL PHY Unit" }, [SMCA_GMI_PHY] = { "Global Memory Interconnect PHY Unit" }, }; void amd_decode_errcode(struct mce_event *e) { decode_amd_errcode(e); if (e->status & MCI_STATUS_POISON) mce_snprintf(e->mcistatus_msg, "Poison consumed"); if (e->status & MCI_STATUS_TCC) mce_snprintf(e->mcistatus_msg, "Task_context_corrupt"); } /* * To find the UMC channel represented by this bank we need to match on its * instance_id. The instance_id of a bank is held in the lower 32 bits of its * IPID. */ static int find_umc_channel(struct mce_event *e) { return EXTRACT(e->ipid, 0, 31) >> 20; } /* * The HBM memory managed by the UMCCH of the noncpu node * can be calculated based on the [15:12]bits of IPID */ static int find_hbm_channel(struct mce_event *e) { int umc, tmp; umc = EXTRACT(e->ipid, 0, 31) >> 20; /* * The HBM channel managed by the UMC of the noncpu node * can be calculated based on the [15:12]bits of IPID as follows */ tmp = ((e->ipid >> 12) & 0xf); return (umc % 2) ? tmp + 4 : tmp; } static inline void fixup_hwid(struct mce_priv *m, uint32_t *hwid_mcatype) { if (m->family == 0x19) { switch (m->model) { /* * Per Genoa's revision guide, erratum 1384, some SMCA Extended * Error Codes and SMCA Control bits are incorrect for SMCA CS * bank type. */ case 0x10 ... 0x1F: case 0x60 ... 0x7B: case 0xA0 ... 0xAF: if (*hwid_mcatype == HWID_MCATYPE(0x2E, 0x2)) *hwid_mcatype = HWID_MCATYPE(0x0, 0x1); break; case 0x90 ... 0x9F: if (*hwid_mcatype == HWID_MCATYPE(0x96, 0x0)) *hwid_mcatype = HWID_MCATYPE(0x0, 0x2); break; default: break; } } else if (m->family == 0x1A) { switch (m->model) { case 0x40 ... 0x4F: if (*hwid_mcatype == HWID_MCATYPE(0x2E, 0x2)) *hwid_mcatype = HWID_MCATYPE(0x0, 0x1); break; default: break; } } } /* Decode extended errors according to Scalable MCA specification */ void decode_smca_error(struct mce_event *e, struct mce_priv *m) { enum smca_bank_types bank_type; const char *ip_name; uint32_t mcatype_hwid = 0; unsigned short xec = (e->status >> 16) & 0x3f; const struct smca_hwid *s_hwid; uint32_t ipid_high = EXTRACT(e->ipid, 32, 63); uint8_t mcatype_instancehi = EXTRACT(e->ipid, 44, 47); unsigned int csrow = -1, channel = -1; unsigned int i; mcatype_hwid = HWID_MCATYPE(ipid_high & MCI_IPID_HWID, (ipid_high & MCI_IPID_MCATYPE) >> 16); fixup_hwid(m, &mcatype_hwid); for (i = 0; i < ARRAY_SIZE(smca_hwid_mcatypes); i++) { s_hwid = &smca_hwid_mcatypes[i]; if (mcatype_hwid == s_hwid->mcatype_hwid) { bank_type = s_hwid->bank_type; break; } if (mcatype_instancehi >= NONCPU_NODE_INDEX) bank_type = SMCA_UMC_V2; } if (i >= MAX_NR_BANKS) { strcpy(e->mcastatus_msg, "Couldn't find bank type with IPID"); return; } if (bank_type >= N_SMCA_BANK_TYPES) { strcpy(e->mcastatus_msg, "Don't know how to decode this bank"); return; } if (bank_type == SMCA_RESERVED) { strcpy(e->mcastatus_msg, "Bank 4 is reserved.\n"); return; } ip_name = smca_names[bank_type].name; mce_snprintf(e->bank_name, "%s (bank=%d)", ip_name, e->bank); /* Only print the descriptor of valid extended error code */ if (xec < smca_mce_descs[bank_type].num_descs) mce_snprintf(e->mcastatus_msg, "%s. Ext Err Code: %d", smca_mce_descs[bank_type].descs[xec], xec); if ((bank_type == SMCA_UMC || bank_type == SMCA_UMC_QUIRK) && xec == 0) { if ((m->family == 0x19) && (m->model >= 0x90 && m->model <= 0x9f)) { /* MCA_IPID[InstanceIdHi] give the AMD Node Die ID */ mce_snprintf(e->mc_location, "memory_die_id=%d", mcatype_instancehi / 4); } else { channel = find_umc_channel(e); csrow = e->synd & 0x7; /* Bit 0, 1 ,2 */ mce_snprintf(e->mc_location, "memory_channel=%d,csrow=%d", channel, csrow); } } if (bank_type == SMCA_UMC_V2 && xec == 0) { /* The UMCPHY is reported as csrow in case of noncpu nodes */ csrow = find_umc_channel(e) / 2; /* UMCCH is managing the HBM memory */ channel = find_hbm_channel(e); mce_snprintf(e->mc_location, "memory_channel=%d,csrow=%d", channel, csrow); } if (e->vdata_len) { uint64_t smca_config = e->vdata[2]; /* * BIT 9 of the CONFIG register of a few SMCA Bank types indicates * presence of FRU Text in SYND 1 / 2 registers */ if (smca_config & BIT(9)) memcpy(e->frutext, e->vdata, 16); } } int parse_amd_smca_event(struct ras_events *ras, struct mce_event *e) { uint64_t mcgstatus = e->mcgstatus; mce_snprintf(e->mcgstatus_msg, "mcgstatus=%lld", (long long)e->mcgstatus); if (mcgstatus & MCG_STATUS_RIPV) mce_snprintf(e->mcgstatus_msg, "RIPV"); if (mcgstatus & MCG_STATUS_EIPV) mce_snprintf(e->mcgstatus_msg, "EIPV"); if (mcgstatus & MCG_STATUS_MCIP) mce_snprintf(e->mcgstatus_msg, "MCIP"); decode_smca_error(e, ras->mce_priv); amd_decode_errcode(e); return 0; } 07070100000027000081A400000000000000000000000165C04BE400000F5A000000000000000000000000000000000000002900000000rasdaemon-0.8.0.49.git+f9cb13b/mce-amd.c/* * Copyright (c) 2018, The AMD, Inc. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 and * only version 2 as published by the Free Software Foundation. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. */ #include <stdio.h> #include <string.h> #include "ras-mce-handler.h" /* Error Code Types */ #define TLB_ERROR(x) (((x) & 0xFFF0) == 0x0010) #define MEM_ERROR(x) (((x) & 0xFF00) == 0x0100) #define BUS_ERROR(x) (((x) & 0xF800) == 0x0800) #define INT_ERROR(x) (((x) & 0xF4FF) == 0x0400) /* Error code: transaction type (TT) */ static char *transaction[] = { "instruction", "data", "generic", "reserved" }; /* Error codes: cache level (LL) */ static char *cachelevel[] = { "reserved", "L1", "L2", "L3/generic" }; /* Error codes: memory transaction type (RRRR) */ static char *memtrans[] = { "generic", "generic read", "generic write", "data read", "data write", "instruction fetch", "prefetch", "evict", "snoop", "?", "?", "?", "?", "?", "?", "?" }; /* Participation Processor */ static char *partproc[] = { "local node origin", "local node response", "local node observed", "generic participation" }; /* Timeout */ static char *timeout[] = { "request didn't time out", "request timed out" }; /* internal unclassified error code */ static char *internal[] = { "reserved", "reserved", "hardware assert", "reserved" }; #define TT(x) (((x) >> 2) & 0x3) /*bit 2, bit 3*/ #define TT_MSG(x) transaction[TT(x)] #define LL(x) ((x) & 0x3) /*bit 0, bit 1*/ #define LL_MSG(x) cachelevel[LL(x)] #define R4(x) (((x) >> 4) & 0xF) /*bit 4, bit 5, bit 6, bit 7 */ #define R4_MSG(x) ((R4(x) < 9) ? memtrans[R4(x)] : "Wrong R4!") #define TO(x) (((x) >> 8) & 0x1) /*bit 8*/ #define TO_MSG(x) timeout[TO(x)] #define PP(x) (((x) >> 9) & 0x3) /*bit 9, bit 10*/ #define PP_MSG(x) partproc[PP(x)] #define UU(x) (((x) >> 8) & 0x3) /*bit 8, bit 9*/ #define UU_MSG(x) internal[UU(x)] void decode_amd_errcode(struct mce_event *e) { uint16_t ec = e->status & 0xffff; uint16_t ecc = (e->status >> 45) & 0x3; if (e->status & MCI_STATUS_UC) { if (e->status & MCI_STATUS_PCC) strcpy(e->error_msg, "System Fatal error."); if (e->mcgstatus & MCG_STATUS_RIPV) strcpy(e->error_msg, "Uncorrected, software restartable error."); strcpy(e->error_msg, "Uncorrected, software containable error."); } else if (e->status & MCI_STATUS_DEFERRED) strcpy(e->error_msg, "Deferred error, no action required."); else strcpy(e->error_msg, "Corrected error, no action required."); if (!(e->status & MCI_STATUS_VAL)) mce_snprintf(e->mcistatus_msg, "MCE_INVALID"); if (e->status & MCI_STATUS_OVER) mce_snprintf(e->mcistatus_msg, "Error_overflow"); if (e->status & MCI_STATUS_PCC) mce_snprintf(e->mcistatus_msg, "Processor_context_corrupt"); if (ecc) mce_snprintf(e->mcistatus_msg, "%sECC", ((ecc == 2) ? "C" : "U")); if (INT_ERROR(ec)) { mce_snprintf(e->mcastatus_msg, "Internal '%s'", UU_MSG(ec)); return; } if (TLB_ERROR(ec)) mce_snprintf(e->mcastatus_msg, "TLB Error 'tx: %s, level: %s'", TT_MSG(ec), LL_MSG(ec)); else if (MEM_ERROR(ec)) mce_snprintf(e->mcastatus_msg, "Memory Error 'mem-tx: %s, tx: %s, level: %s'", R4_MSG(ec), TT_MSG(ec), LL_MSG(ec)); else if (BUS_ERROR(ec)) mce_snprintf(e->mcastatus_msg, "Bus Error '%s, %s, mem-tx: %s, level: %s'", PP_MSG(ec), TO_MSG(ec), R4_MSG(ec), LL_MSG(ec)); return; } 07070100000028000081A400000000000000000000000165C04BE400001157000000000000000000000000000000000000003800000000rasdaemon-0.8.0.49.git+f9cb13b/mce-intel-broadwell-de.c/* * The code below came from Tony Luck's mcelog code, * released under GNU Public General License, v.2 * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <string.h> #include <stdio.h> #include "ras-mce-handler.h" #include "bitfield.h" /* See IA32 SDM Vol3B Table 16-24 */ static char *pcu_1[] = { [0x00] = "No Error", [0x09] = "MC_MESSAGE_CHANNEL_TIMEOUT", [0x13] = "MC_DMI_TRAINING_TIMEOUT", [0x15] = "MC_DMI_CPU_RESET_ACK_TIMEOUT", [0x1E] = "MC_VR_ICC_MAX_LT_FUSED_ICC_MAX", [0x25] = "MC_SVID_COMMAN_TIMEOUT", [0x26] = "MCA_PKGC_DIRECT_WAKE_RING_TIMEOUT", [0x29] = "MC_VR_VOUT_MAC_LT_FUSED_SVID", [0x2B] = "MC_PKGC_WATCHDOG_HANG_CBZ_DOWN", [0x2C] = "MC_PKGC_WATCHDOG_HANG_CBZ_UP", [0x44] = "MC_CRITICAL_VR_FAILED", [0x46] = "MC_VID_RAMP_DOWN_FAILED", [0x49] = "MC_SVID_WRITE_REG_VOUT_MAX_FAILED", [0x4B] = "MC_BOOT_VID_TIMEOUT_DRAM_0", [0x4F] = "MC_SVID_COMMAND_ERROR", [0x52] = "MC_FIVR_CATAS_OVERVOL_FAULT", [0x53] = "MC_FIVR_CATAS_OVERCUR_FAULT", [0x57] = "MC_SVID_PKGC_REQUEST_FAILED", [0x58] = "MC_SVID_IMON_REQUEST_FAILED", [0x59] = "MC_SVID_ALERT_REQUEST_FAILED", [0x62] = "MC_INVALID_PKGS_RSP_QPI", [0x64] = "MC_INVALID_PKG_STATE_CONFIG", [0x67] = "MC_HA_IMC_RW_BLOCK_ACK_TIMEOUT", [0x6A] = "MC_MSGCH_PMREQ_CMP_TIMEOUT", [0x72] = "MC_WATCHDOG_TIMEOUT_PKGS_MASTER", [0x81] = "MC_RECOVERABLE_DIE_THERMAL_TOO_HOT" }; static struct field pcu_mc4[] = { FIELD(24, pcu_1), {} }; /* See IA32 SDM Vol3B Table 16-18 */ static struct field memctrl_mc9[] = { SBITFIELD(16, "Address parity error"), SBITFIELD(17, "HA Wrt buffer Data parity error"), SBITFIELD(18, "HA Wrt byte enable parity error"), SBITFIELD(19, "Corrected patrol scrub error"), SBITFIELD(20, "Uncorrected patrol scrub error"), SBITFIELD(21, "Corrected spare error"), SBITFIELD(22, "Uncorrected spare error"), SBITFIELD(23, "Corrected memory read error"), SBITFIELD(24, "iMC, WDB, parity errors"), {} }; void broadwell_de_decode_model(struct ras_events *ras, struct mce_event *e) { uint64_t status = e->status; uint32_t mca = status & 0xffff; unsigned int rank0 = -1, rank1 = -1, chan; switch (e->bank) { case 4: switch (EXTRACT(status, 0, 15) & ~(1ull << 12)) { case 0x402: case 0x403: mce_snprintf(e->mcastatus_msg, "Internal errors "); break; case 0x406: mce_snprintf(e->mcastatus_msg, "Intel TXT errors "); break; case 0x407: mce_snprintf(e->mcastatus_msg, "Other UBOX Internal errors "); break; } if (EXTRACT(status, 16, 19) & 3) mce_snprintf(e->mcastatus_msg, "PCU internal error "); if (EXTRACT(status, 20, 23) & 4) mce_snprintf(e->mcastatus_msg, "Ubox error "); decode_bitfield(e, status, pcu_mc4); break; case 9: case 10: mce_snprintf(e->mcastatus_msg, "MemCtrl: "); decode_bitfield(e, status, memctrl_mc9); break; } /* * Memory error specific code. Returns if the error is not a MC one */ /* Check if the error is at the memory controller */ if ((mca >> 7) != 1) return; /* Ignore unless this is an corrected extended error from an iMC bank */ if (e->bank < 9 || e->bank > 16 || (status & MCI_STATUS_UC) || !test_prefix(7, status & 0xefff)) return; /* * Parse the reported channel and ranks */ chan = EXTRACT(status, 0, 3); if (chan == 0xf) return; mce_snprintf(e->mc_location, "memory_channel=%d", chan); if (EXTRACT(e->misc, 62, 62)) { rank0 = EXTRACT(e->misc, 46, 50); if (EXTRACT(e->misc, 63, 63)) rank1 = EXTRACT(e->misc, 51, 55); } /* * FIXME: The conversion from rank to dimm requires to parse the * DMI tables and call failrank2dimm(). */ if (rank0 != -1 && rank1 != -1) mce_snprintf(e->mc_location, "ranks=%d and %d", rank0, rank1); else if (rank0 != -1) mce_snprintf(e->mc_location, "rank=%d", rank0); } 07070100000029000081A400000000000000000000000165C04BE4000017F0000000000000000000000000000000000000003A00000000rasdaemon-0.8.0.49.git+f9cb13b/mce-intel-broadwell-epex.c/* * The code below came from Tony Luck's mcelog code, * released under GNU Public General License, v.2 * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <string.h> #include <stdio.h> #include "ras-mce-handler.h" #include "bitfield.h" /* See IA32 SDM Vol3B Table 16-20 */ static char *pcu_1[] = { [0x00] = "No Error", [0x09] = "MC_MESSAGE_CHANNEL_TIMEOUT", [0x0D] = "MC_IMC_FORCE_SR_S3_TIMEOUT", [0x0E] = "MC_CPD_UNCPD_SD_TIMEOUT", [0x13] = "MC_DMI_TRAINING_TIMEOUT", [0x15] = "MC_DMI_CPU_RESET_ACK_TIMEOUT", [0x1E] = "MC_VR_ICC_MAX_LT_FUSED_ICC_MAX", [0x25] = "MC_SVID_COMMAN_TIMEOUT", [0x29] = "MC_VR_VOUT_MAC_LT_FUSED_SVID", [0x2B] = "MC_PKGC_WATCHDOG_HANG_CBZ_DOWN", [0x2C] = "MC_PKGC_WATCHDOG_HANG_CBZ_UP", [0x39] = "MC_PKGC_WATCHDOG_HANG_C3_UP_SF", [0x44] = "MC_CRITICAL_VR_FAILED", [0x45] = "MC_ICC_MAX_NOTSUPPORTED", [0x46] = "MC_VID_RAMP_DOWN_FAILED", [0x47] = "MC_EXCL_MODE_NO_PMREQ_CMP", [0x48] = "MC_SVID_READ_REG_ICC_MAX_FAILED", [0x49] = "MC_SVID_WRITE_REG_VOUT_MAX_FAILED", [0x4B] = "MC_BOOT_VID_TIMEOUT_DRAM_0", [0x4C] = "MC_BOOT_VID_TIMEOUT_DRAM_1", [0x4D] = "MC_BOOT_VID_TIMEOUT_DRAM_2", [0x4E] = "MC_BOOT_VID_TIMEOUT_DRAM_3", [0x4F] = "MC_SVID_COMMAND_ERROR", [0x52] = "MC_FIVR_CATAS_OVERVOL_FAULT", [0x53] = "MC_FIVR_CATAS_OVERCUR_FAULT", [0x57] = "MC_SVID_PKGC_REQUEST_FAILED", [0x58] = "MC_SVID_IMON_REQUEST_FAILED", [0x59] = "MC_SVID_ALERT_REQUEST_FAILED", [0x60] = "MC_INVALID_PKGS_REQ_PCH", [0x61] = "MC_INVALID_PKGS_REQ_QPI", [0x62] = "MC_INVALID_PKGS_RSP_QPI", [0x63] = "MC_INVALID_PKGS_RSP_PCH", [0x64] = "MC_INVALID_PKG_STATE_CONFIG", [0x67] = "MC_HA_IMC_RW_BLOCK_ACK_TIMEOUT", [0x68] = "MC_IMC_RW_SMBUS_TIMEOUT", [0x69] = "MC_HA_FAILSTS_CHANGE_DETECTED", [0x6A] = "MC_MSGCH_PMREQ_CMP_TIMEOUT", [0x70] = "MC_WATCHDOG_TIMEOUT_PKGC_SLAVE", [0x71] = "MC_WATCHDOG_TIMEOUT_PKGC_MASTER", [0x72] = "MC_WATCHDOG_TIMEOUT_PKGS_MASTER", [0x7C] = "MC_BIOS_RST_CPL_INVALID_SEQ", [0x7D] = "MC_MORE_THAN_ONE_TXT_AGENT", [0x81] = "MC_RECOVERABLE_DIE_THERMAL_TOO_HOT" }; static struct field pcu_mc4[] = { FIELD(24, pcu_1), {} }; /* See IA32 SDM Vol3B Table 16-21 */ static char *qpi[] = { [0x02] = "Intel QPI physical layer detected drift buffer alarm", [0x03] = "Intel QPI physical layer detected latency buffer rollover", [0x10] = "Intel QPI link layer detected control error from R3QPI", [0x11] = "Rx entered LLR abort state on CRC error", [0x12] = "Unsupported or undefined packet", [0x13] = "Intel QPI link layer control error", [0x15] = "RBT used un-initialized value", [0x20] = "Intel QPI physical layer detected a QPI in-band reset but aborted initialization", [0x21] = "Link failover data self healing", [0x22] = "Phy detected in-band reset (no width change)", [0x23] = "Link failover clock failover", [0x30] = "Rx detected CRC error - successful LLR after Phy re-init", [0x31] = "Rx detected CRC error - successful LLR without Phy re-init", }; static struct field qpi_mc[] = { FIELD(16, qpi), {} }; /* See IA32 SDM Vol3B Table 16-26 */ static struct field memctrl_mc9[] = { SBITFIELD(16, "DDR3 address parity error"), SBITFIELD(17, "Uncorrected HA write data error"), SBITFIELD(18, "Uncorrected HA data byte enable error"), SBITFIELD(19, "Corrected patrol scrub error"), SBITFIELD(20, "Uncorrected patrol scrub error"), SBITFIELD(21, "Corrected spare error"), SBITFIELD(22, "Uncorrected spare error"), SBITFIELD(24, "iMC write data buffer parity error"), SBITFIELD(25, "DDR4 command address parity error"), {} }; void broadwell_epex_decode_model(struct ras_events *ras, struct mce_event *e) { uint64_t status = e->status; uint32_t mca = status & 0xffff; unsigned int rank0 = -1, rank1 = -1, chan; switch (e->bank) { case 4: switch (EXTRACT(status, 0, 15) & ~(1ull << 12)) { case 0x402: case 0x403: mce_snprintf(e->mcastatus_msg, "Internal errors "); break; case 0x406: mce_snprintf(e->mcastatus_msg, "Intel TXT errors "); break; case 0x407: mce_snprintf(e->mcastatus_msg, "Other UBOX Internal errors "); break; } if (EXTRACT(status, 16, 19)) mce_snprintf(e->mcastatus_msg, "PCU internal error "); decode_bitfield(e, status, pcu_mc4); break; case 5: case 20: case 21: mce_snprintf(e->mcastatus_msg, "QPI: "); decode_bitfield(e, status, qpi_mc); break; case 9: case 10: case 11: case 12: case 13: case 14: case 15: case 16: mce_snprintf(e->mcastatus_msg, "MemCtrl: "); decode_bitfield(e, status, memctrl_mc9); break; } /* * Memory error specific code. Returns if the error is not a MC one */ /* Check if the error is at the memory controller */ if ((mca >> 7) != 1) return; /* Ignore unless this is an corrected extended error from an iMC bank */ if (e->bank < 9 || e->bank > 16 || (status & MCI_STATUS_UC) || !test_prefix(7, status & 0xefff)) return; /* * Parse the reported channel and ranks */ chan = EXTRACT(status, 0, 3); if (chan == 0xf) return; mce_snprintf(e->mc_location, "memory_channel=%d", chan); if (EXTRACT(e->misc, 62, 62)) { rank0 = EXTRACT(e->misc, 46, 50); if (EXTRACT(e->misc, 63, 63)) rank1 = EXTRACT(e->misc, 51, 55); } /* * FIXME: The conversion from rank to dimm requires to parse the * DMI tables and call failrank2dimm(). */ if (rank0 != -1 && rank1 != -1) mce_snprintf(e->mc_location, "ranks=%d and %d", rank0, rank1); else if (rank0 != -1) mce_snprintf(e->mc_location, "rank=%d", rank0); } 0707010000002A000081A400000000000000000000000165C04BE400000D9B000000000000000000000000000000000000003600000000rasdaemon-0.8.0.49.git+f9cb13b/mce-intel-dunnington.c/* * The code below came from Andi Kleen/Intel/SuSe mcelog code, * released under GNU Public General License, v.2 * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <string.h> #include <stdio.h> #include "ras-mce-handler.h" #include "bitfield.h" /* Follows Intel IA32 SDM 3b Appendix E.2.1 ++ */ static struct field dunnington_bus_status[] = { SBITFIELD(16, "Parity error detected during FSB request phase"), FIELD_NULL(17), SBITFIELD(20, "Hard Failure response received for a local transaction"), SBITFIELD(21, "Parity error on FSB response field detected"), SBITFIELD(22, "Parity data error on inbound data detected"), FIELD_NULL(23), FIELD_NULL(25), FIELD_NULL(28), FIELD_NULL(31), {} }; static char *dnt_front_error[0xf] = { [0x1] = "Inclusion error from core 0", [0x2] = "Inclusion error from core 1", [0x3] = "Write Exclusive error from core 0", [0x4] = "Write Exclusive error from core 1", [0x5] = "Inclusion error from FSB", [0x6] = "SNP stall error from FSB", [0x7] = "Write stall error from FSB", [0x8] = "FSB Arbiter Timeout error", [0xA] = "Inclusion error from core 2", [0xB] = "Write exclusive error from core 2", }; static char *dnt_int_error[0xf] = { [0x2] = "Internal timeout error", [0x3] = "Internal timeout error", [0x4] = "Intel Cache Safe Technology Queue full error\n" "or disabled ways in a set overflow", [0x5] = "Quiet cycle timeout error (correctable)", }; struct field dnt_int_status[] = { FIELD(8, dnt_int_error), {} }; struct field dnt_front_status[] = { FIELD(0, dnt_front_error), {} }; struct field dnt_cecc[] = { SBITFIELD(1, "Correctable ECC event on outgoing core 0 data"), SBITFIELD(2, "Correctable ECC event on outgoing core 1 data"), SBITFIELD(3, "Correctable ECC event on outgoing core 2 data"), {} }; struct field dnt_uecc[] = { SBITFIELD(1, "Uncorrectable ECC event on outgoing core 0 data"), SBITFIELD(2, "Uncorrectable ECC event on outgoing core 1 data"), SBITFIELD(3, "Uncorrectable ECC event on outgoing core 2 data"), {} }; static void dunnington_decode_bus(struct mce_event *e, uint64_t status) { decode_bitfield(e, status, dunnington_bus_status); } static void dunnington_decode_internal(struct mce_event *e, uint64_t status) { uint32_t mca = (status >> 16) & 0xffff; if ((mca & 0xfff0) == 0) decode_bitfield(e, mca, dnt_front_status); else if ((mca & 0xf0ff) == 0) decode_bitfield(e, mca, dnt_int_status); else if ((mca & 0xfff0) == 0xc000) decode_bitfield(e, mca, dnt_cecc); else if ((mca & 0xfff0) == 0xe000) decode_bitfield(e, mca, dnt_uecc); } void dunnington_decode_model(struct mce_event *e) { uint64_t status = e->status; if ((status & 0xffff) == 0xe0f) dunnington_decode_bus(e, status); else if ((status & 0xffff) == (1 << 10)) dunnington_decode_internal(e, status); } 0707010000002B000081A400000000000000000000000165C04BE4000017D1000000000000000000000000000000000000003300000000rasdaemon-0.8.0.49.git+f9cb13b/mce-intel-haswell.c/* * The code below came from Tony Luck mcelog code, * released under GNU Public General License, v.2 * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <string.h> #include <stdio.h> #include "ras-mce-handler.h" #include "bitfield.h" /* See IA32 SDM Vol3B Table 16-20 */ static char *pcu_1[] = { [0x00] = "No Error", [0x09] = "MC_MESSAGE_CHANNEL_TIMEOUT", [0x0D] = "MC_IMC_FORCE_SR_S3_TIMEOUT", [0x0E] = "MC_CPD_UNCPD_SD_TIMEOUT", [0x13] = "MC_DMI_TRAINING_TIMEOUT", [0x15] = "MC_DMI_CPU_RESET_ACK_TIMEOUT", [0x1E] = "MC_VR_ICC_MAX_LT_FUSED_ICC_MAX", [0x25] = "MC_SVID_COMMAN_TIMEOUT", [0x29] = "MC_VR_VOUT_MAC_LT_FUSED_SVID", [0x2B] = "MC_PKGC_WATCHDOG_HANG_CBZ_DOWN", [0x2C] = "MC_PKGC_WATCHDOG_HANG_CBZ_UP", [0x39] = "MC_PKGC_WATCHDOG_HANG_C3_UP_SF", [0x44] = "MC_CRITICAL_VR_FAILED", [0x45] = "MC_ICC_MAX_NOTSUPPORTED", [0x46] = "MC_VID_RAMP_DOWN_FAILED", [0x47] = "MC_EXCL_MODE_NO_PMREQ_CMP", [0x48] = "MC_SVID_READ_REG_ICC_MAX_FAILED", [0x49] = "MC_SVID_WRITE_REG_VOUT_MAX_FAILED", [0x4B] = "MC_BOOT_VID_TIMEOUT_DRAM_0", [0x4C] = "MC_BOOT_VID_TIMEOUT_DRAM_1", [0x4D] = "MC_BOOT_VID_TIMEOUT_DRAM_2", [0x4E] = "MC_BOOT_VID_TIMEOUT_DRAM_3", [0x4F] = "MC_SVID_COMMAND_ERROR", [0x52] = "MC_FIVR_CATAS_OVERVOL_FAULT", [0x53] = "MC_FIVR_CATAS_OVERCUR_FAULT", [0x57] = "MC_SVID_PKGC_REQUEST_FAILED", [0x58] = "MC_SVID_IMON_REQUEST_FAILED", [0x59] = "MC_SVID_ALERT_REQUEST_FAILED", [0x60] = "MC_INVALID_PKGS_REQ_PCH", [0x61] = "MC_INVALID_PKGS_REQ_QPI", [0x62] = "MC_INVALID_PKGS_RSP_QPI", [0x63] = "MC_INVALID_PKGS_RSP_PCH", [0x64] = "MC_INVALID_PKG_STATE_CONFIG", [0x67] = "MC_HA_IMC_RW_BLOCK_ACK_TIMEOUT", [0x68] = "MC_IMC_RW_SMBUS_TIMEOUT", [0x69] = "MC_HA_FAILSTS_CHANGE_DETECTED", [0x6A] = "MC_MSGCH_PMREQ_CMP_TIMEOUT", [0x70] = "MC_WATCHDOG_TIMEOUT_PKGC_SLAVE", [0x71] = "MC_WATCHDOG_TIMEOUT_PKGC_MASTER", [0x72] = "MC_WATCHDOG_TIMEOUT_PKGS_MASTER", [0x7C] = "MC_BIOS_RST_CPL_INVALID_SEQ", [0x7D] = "MC_MORE_THAN_ONE_TXT_AGENT", [0x81] = "MC_RECOVERABLE_DIE_THERMAL_TOO_HOT" }; static struct field pcu_mc4[] = { FIELD(24, pcu_1), {} }; /* See IA32 SDM Vol3B Table 16-21 */ static char *qpi[] = { [0x02] = "Intel QPI physical layer detected drift buffer alarm", [0x03] = "Intel QPI physical layer detected latency buffer rollover", [0x10] = "Intel QPI link layer detected control error from R3QPI", [0x11] = "Rx entered LLR abort state on CRC error", [0x12] = "Unsupported or undefined packet", [0x13] = "Intel QPI link layer control error", [0x15] = "RBT used un-initialized value", [0x20] = "Intel QPI physical layer detected a QPI in-band reset but aborted initialization", [0x21] = "Link failover data self healing", [0x22] = "Phy detected in-band reset (no width change)", [0x23] = "Link failover clock failover", [0x30] = "Rx detected CRC error - successful LLR after Phy re-init", [0x31] = "Rx detected CRC error - successful LLR without Phy re-init", }; static struct field qpi_mc[] = { FIELD(16, qpi), {} }; /* See IA32 SDM Vol3B Table 16-22 */ static struct field memctrl_mc9[] = { SBITFIELD(16, "DDR3 address parity error"), SBITFIELD(17, "Uncorrected HA write data error"), SBITFIELD(18, "Uncorrected HA data byte enable error"), SBITFIELD(19, "Corrected patrol scrub error"), SBITFIELD(20, "Uncorrected patrol scrub error"), SBITFIELD(21, "Corrected spare error"), SBITFIELD(22, "Uncorrected spare error"), SBITFIELD(23, "Corrected memory read error"), SBITFIELD(24, "iMC write data buffer parity error"), SBITFIELD(25, "DDR4 command address parity error"), {} }; void hsw_decode_model(struct ras_events *ras, struct mce_event *e) { uint64_t status = e->status; uint32_t mca = status & 0xffff; unsigned int rank0 = -1, rank1 = -1, chan; switch (e->bank) { case 4: switch (EXTRACT(status, 0, 15) & ~(1ull << 12)) { case 0x402: case 0x403: mce_snprintf(e->mcastatus_msg, "PCU Internal Errors"); break; case 0x406: mce_snprintf(e->mcastatus_msg, "Intel TXT Errors"); break; case 0x407: mce_snprintf(e->mcastatus_msg, "Other UBOX Internal Errors"); break; } if (EXTRACT(status, 16, 17) && !EXTRACT(status, 18, 19)) mce_snprintf(e->error_msg, "PCU Internal error"); decode_bitfield(e, status, pcu_mc4); break; case 5: case 20: case 21: decode_bitfield(e, status, qpi_mc); break; case 9: case 10: case 11: case 12: case 13: case 14: case 15: case 16: decode_bitfield(e, status, memctrl_mc9); break; } /* * Memory error specific code. Returns if the error is not a MC one */ /* Check if the error is at the memory controller */ if ((mca >> 7) != 1) return; /* Ignore unless this is an corrected extended error from an iMC bank */ if (e->bank < 9 || e->bank > 16 || (status & MCI_STATUS_UC) || !test_prefix(7, status & 0xefff)) return; /* * Parse the reported channel and ranks */ chan = EXTRACT(status, 0, 3); if (chan == 0xf) return; mce_snprintf(e->mc_location, "memory_channel=%d", chan); if (EXTRACT(e->misc, 62, 62)) { rank0 = EXTRACT(e->misc, 46, 50); if (EXTRACT(e->misc, 63, 63)) rank1 = EXTRACT(e->misc, 51, 55); } /* * FIXME: The conversion from rank to dimm requires to parse the * DMI tables and call failrank2dimm(). */ if (rank0 != -1 && rank1 != -1) mce_snprintf(e->mc_location, "ranks=%d and %d", rank0, rank1); else if (rank0 != -1) mce_snprintf(e->mc_location, "rank=%d", rank0); } 0707010000002C000081A400000000000000000000000165C04BE40000386A000000000000000000000000000000000000003100000000rasdaemon-0.8.0.49.git+f9cb13b/mce-intel-i10nm.c/* * The code below came from Tony Luck's mcelog code, * released under GNU Public General License, v.2 * * Copyright (C) 2019 Intel Corporation * Decode Intel 10nm specific machine check errors. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <inttypes.h> #include <stdio.h> #include <string.h> #include "ras-mce-handler.h" #include "bitfield.h" static char *pcu_1[] = { [0x0D] = "MCA_LLC_BIST_ACTIVE_TIMEOUT", [0x0E] = "MCA_DMI_TRAINING_TIMEOUT", [0x0F] = "MCA_DMI_STRAP_SET_ARRIVAL_TIMEOUT", [0x10] = "MCA_DMI_CPU_RESET_ACK_TIMEOUT", [0x11] = "MCA_MORE_THAN_ONE_LT_AGENT", [0x14] = "MCA_INCOMPATIBLE_PCH_TYPE", [0x1E] = "MCA_BIOS_RST_CPL_INVALID_SEQ", [0x1F] = "MCA_BIOS_INVALID_PKG_STATE_CONFIG", [0x2D] = "MCA_PCU_PMAX_CALIB_ERROR", [0x2E] = "MCA_TSC100_SYNC_TIMEOUT", [0x3A] = "MCA_GPSB_TIMEOUT", [0x3B] = "MCA_PMSB_TIMEOUT", [0x3E] = "MCA_IOSFSB_PMREQ_CMP_TIMEOUT", [0x40] = "MCA_SVID_VCCIN_VR_ICC_MAX_FAILURE", [0x42] = "MCA_SVID_VCCIN_VR_VOUT_FAILURE", [0x43] = "MCA_SVID_CPU_VR_CAPABILITY_ERROR", [0x44] = "MCA_SVID_CRITICAL_VR_FAILED", [0x45] = "MCA_SVID_SA_ITD_ERROR", [0x46] = "MCA_SVID_READ_REG_FAILED", [0x47] = "MCA_SVID_WRITE_REG_FAILED", [0x4A] = "MCA_SVID_PKGC_REQUEST_FAILED", [0x4B] = "MCA_SVID_IMON_REQUEST_FAILED", [0x4C] = "MCA_SVID_ALERT_REQUEST_FAILED", [0x4D] = "MCA_SVID_MCP_VR_RAMP_ERROR", [0x56] = "MCA_FIVR_PD_HARDERR", [0x58] = "MCA_WATCHDOG_TIMEOUT_PKGC_SLAVE", [0x59] = "MCA_WATCHDOG_TIMEOUT_PKGC_MASTER", [0x5A] = "MCA_WATCHDOG_TIMEOUT_PKGS_MASTER", [0x5B] = "MCA_WATCHDOG_TIMEOUT_MSG_CH_FSM", [0x5C] = "MCA_WATCHDOG_TIMEOUT_BULK_CR_FSM", [0x5D] = "MCA_WATCHDOG_TIMEOUT_IOSFSB_FSM", [0x60] = "MCA_PKGS_SAFE_WP_TIMEOUT", [0x61] = "MCA_PKGS_CPD_UNCPD_TIMEOUT", [0x62] = "MCA_PKGS_INVALID_REQ_PCH", [0x63] = "MCA_PKGS_INVALID_REQ_INTERNAL", [0x64] = "MCA_PKGS_INVALID_RSP_INTERNAL", [0x65 ... 0x7A] = "MCA_PKGS_RESET_PREP_TIMEOUT", [0x7B] = "MCA_PKGS_SMBUS_VPP_PAUSE_TIMEOUT", [0x7C] = "MCA_PKGS_SMBUS_MCP_PAUSE_TIMEOUT", [0x7D] = "MCA_PKGS_SMBUS_SPD_PAUSE_TIMEOUT", [0x80] = "MCA_PKGC_DISP_BUSY_TIMEOUT", [0x81] = "MCA_PKGC_INVALID_RSP_PCH", [0x83] = "MCA_PKGC_WATCHDOG_HANG_CBZ_DOWN", [0x84] = "MCA_PKGC_WATCHDOG_HANG_CBZ_UP", [0x87] = "MCA_PKGC_WATCHDOG_HANG_C2_BLKMASTER", [0x88] = "MCA_PKGC_WATCHDOG_HANG_C2_PSLIMIT", [0x89] = "MCA_PKGC_WATCHDOG_HANG_SETDISP", [0x8B] = "MCA_PKGC_ALLOW_L1_ERROR", [0x90] = "MCA_RECOVERABLE_DIE_THERMAL_TOO_HOT", [0xA0] = "MCA_ADR_SIGNAL_TIMEOUT", [0xA1] = "MCA_BCLK_FREQ_OC_ABOVE_THRESHOLD", [0xB0] = "MCA_DISPATCHER_RUN_BUSY_TIMEOUT", }; static char *pcu_2[] = { [0x04] = "Clock/power IP response timeout", [0x05] = "SMBus controller raised SMI", [0x09] = "PM controller received invalid transaction", }; static char *pcu_3[] = { [0x01] = "Instruction address out of valid space", [0x02] = "Double bit RAM error on Instruction Fetch", [0x03] = "Invalid OpCode seen", [0x04] = "Stack Underflow", [0x05] = "Stack Overflow", [0x06] = "Data address out of valid space", [0x07] = "Double bit RAM error on Data Fetch", }; static struct field pcu1[] = { FIELD(0, pcu_1), {} }; static struct field pcu2[] = { FIELD(0, pcu_2), {} }; static struct field pcu3[] = { FIELD(0, pcu_3), {} }; static struct field upi1[] = { SBITFIELD(22, "Phy Control Error"), SBITFIELD(23, "Unexpected Retry.Ack flit"), SBITFIELD(24, "Unexpected Retry.Req flit"), SBITFIELD(25, "RF parity error"), SBITFIELD(26, "Routeback Table error"), SBITFIELD(27, "Unexpected Tx Protocol flit (EOP, Header or Data)"), SBITFIELD(28, "Rx Header-or-Credit BGF credit overflow/underflow"), SBITFIELD(29, "Link Layer Reset still in progress when Phy enters L0"), SBITFIELD(30, "Link Layer reset initiated while protocol traffic not idle"), SBITFIELD(31, "Link Layer Tx Parity Error"), {} }; static char *upi_2[] = { [0x00] = "Phy Initialization Failure (NumInit)", [0x01] = "Phy Detected Drift Buffer Alarm", [0x02] = "Phy Detected Latency Buffer Rollover", [0x10] = "LL Rx detected CRC error: unsuccessful LLR (entered Abort state)", [0x11] = "LL Rx Unsupported/Undefined packet", [0x12] = "LL or Phy Control Error", [0x13] = "LL Rx Parameter Exception", [0x1F] = "LL Detected Control Error", [0x20] = "Phy Initialization Abort", [0x21] = "Phy Inband Reset", [0x22] = "Phy Lane failure, recovery in x8 width", [0x23] = "Phy L0c error corrected without Phy reset", [0x24] = "Phy L0c error triggering Phy reset", [0x25] = "Phy L0p exit error corrected with reset", [0x30] = "LL Rx detected CRC error: successful LLR without Phy Reinit", [0x31] = "LL Rx detected CRC error: successful LLR with Phy Reinit", [0x32] = "Tx received LLR", }; static struct field upi2[] = { FIELD(0, upi_2), {} }; static struct field m2m[] = { SBITFIELD(16, "MC read data error"), SBITFIELD(17, "Reserved"), SBITFIELD(18, "MC partial write data error"), SBITFIELD(19, "Full write data error"), SBITFIELD(20, "M2M clock-domain-crossing buffer (BGF) error"), SBITFIELD(21, "M2M time out"), SBITFIELD(22, "M2M tracker parity error"), SBITFIELD(23, "fatal Bucket1 error"), {} }; static char *imc_0[] = { [0x01] = "Address parity error", [0x02] = "Data parity error", [0x03] = "Data ECC error", [0x04] = "Data byte enable parity error", [0x07] = "Transaction ID parity error", [0x08] = "Corrected patrol scrub error", [0x10] = "Uncorrected patrol scrub error", [0x20] = "Corrected spare error", [0x40] = "Uncorrected spare error", [0x80] = "Corrected read error", [0xA0] = "Uncorrected read error", [0xC0] = "Uncorrected metadata", }; static char *imc_1[] = { [0x00] = "WDB read parity error", [0x03] = "RPA parity error", [0x06] = "DDR_T_DPPP data BE error", [0x07] = "DDR_T_DPPP data error", [0x08] = "DDR link failure", [0x11] = "PCLS CAM error", [0x12] = "PCLS data error", }; static char *imc_2[] = { [0x00] = "DDR4 command / address parity error", [0x20] = "HBM command / address parity error", [0x21] = "HBM data parity error", }; static char *imc_4[] = { [0x00] = "RPQ parity (primary) error", }; static char *imc_8[] = { [0x00] = "DDR-T bad request", [0x01] = "DDR Data response to an invalid entry", [0x02] = "DDR data response to an entry not expecting data", [0x03] = "DDR4 completion to an invalid entry", [0x04] = "DDR-T completion to an invalid entry", [0x05] = "DDR data/completion FIFO overflow", [0x06] = "DDR-T ERID correctable parity error", [0x07] = "DDR-T ERID uncorrectable error", [0x08] = "DDR-T interrupt received while outstanding interrupt was not ACKed", [0x09] = "ERID FI FO overflow", [0x0A] = "DDR-T error on FNV write credits", [0x0B] = "DDR-T error on FNV read credits", [0x0C] = "DDR-T scheduler error", [0x0D] = "DDR-T FNV error event", [0x0E] = "DDR-T FNV thermal event", [0x0F] = "CMI packet while idle", [0x10] = "DDR_T_RPQ_REQ_PARITY_ERR", [0x11] = "DDR_T_WPQ_REQ_PARITY_ERR", [0x12] = "2LM_NMFILLWR_CAM_ERR", [0x13] = "CMI_CREDIT_OVERSUB_ERR", [0x14] = "CMI_CREDIT_TOTAL_ERR", [0x15] = "CMI_CREDIT_RSVD_POOL_ERR", [0x16] = "DDR_T_RD_ERROR", [0x17] = "WDB_FIFO_ERR", [0x18] = "CMI_REQ_FIFO_OVERFLOW", [0x19] = "CMI_REQ_FIFO_UNDERFLOW", [0x1A] = "CMI_RSP_FIFO_OVERFLOW", [0x1B] = "CMI_RSP_FIFO_UNDERFLOW", [0x1C] = "CMI _MISC_MC_CRDT_ERRORS", [0x1D] = "CMI_MISC_MC_ARB_ERRORS", [0x1E] = "DDR_T_WR_CMPL_FI FO_OVERFLOW", [0x1F] = "DDR_T_WR_CMPL_FI FO_UNDERFLOW", [0x20] = "CMI_RD_CPL_FIFO_OVERFLOW", [0x21] = "CMI_RD_CPL_FIFO_UNDERFLOW", [0x22] = "TME_KEY_PAR_ERR", [0x23] = "TME_CMI_MISC_ERR", [0x24] = "TME_CMI_OVFL_ERR", [0x25] = "TME_CMI_UFL_ERR", [0x26] = "TME_TEM_SECURE_ERR", [0x27] = "TME_UFILL_PAR_ERR", [0x29] = "INTERNAL_ERR", [0x2A] = "TME_INTEGRITY_ERR", [0x2B] = "TME_TDX_ERR", [0x2C] = "TME_UFILL_TEM_SECURE_ERR", [0x2D] = "TME_KEY_POISON_ERR", [0x2E] = "TME_SECURITY_ENGINE_ERR", }; static char *imc_10[] = { [0x08] = "CORR_PATSCRUB_MIRR2ND_ERR", [0x10] = "UC_PATSCRUB_MIRR2ND_ERR", [0x20] = "COR_SPARE_MIRR2ND_ERR", [0x40] = "UC_SPARE_MIRR2ND_ERR", [0x80] = "HA_RD_MIRR2ND_ERR", [0xA0] = "HA_UNCORR_RD_MIRR2ND_ERR", }; static struct field imc0[] = { FIELD(0, imc_0), {} }; static struct field imc1[] = { FIELD(0, imc_1), {} }; static struct field imc2[] = { FIELD(0, imc_2), {} }; static struct field imc4[] = { FIELD(0, imc_4), {} }; static struct field imc8[] = { FIELD(0, imc_8), {} }; static struct field imc10[] = { FIELD(0, imc_10), {} }; static void i10nm_imc_misc(struct mce_event *e) { uint32_t column = EXTRACT(e->misc, 9, 18) << 2; uint32_t row = EXTRACT(e->misc, 19, 39); uint32_t bank = EXTRACT(e->misc, 42, 43); uint32_t bankgroup = EXTRACT(e->misc, 40, 41) | (EXTRACT(e->misc, 44, 44) << 2); uint32_t fdevice = EXTRACT(e->misc, 46, 51); uint32_t subrank = EXTRACT(e->misc, 52, 55); uint32_t rank = EXTRACT(e->misc, 56, 58); uint32_t eccmode = EXTRACT(e->misc, 59, 62); uint32_t transient = EXTRACT(e->misc, 63, 63); mce_snprintf(e->error_msg, "bank: 0x%x bankgroup: 0x%x row: 0x%x column: 0x%x", bank, bankgroup, row, column); if (!transient && !EXTRACT(e->status, 61, 61)) mce_snprintf(e->error_msg, "failed device: 0x%x", fdevice); mce_snprintf(e->error_msg, "rank: 0x%x subrank: 0x%x", rank, subrank); mce_snprintf(e->error_msg, "ecc mode: "); switch (eccmode) { case 0: mce_snprintf(e->error_msg, "SDDC memory mode"); break; case 1: mce_snprintf(e->error_msg, "SDDC"); break; case 4: mce_snprintf(e->error_msg, "ADDDC memory mode"); break; case 5: mce_snprintf(e->error_msg, "ADDDC"); break; case 8: mce_snprintf(e->error_msg, "DDRT read"); break; default: mce_snprintf(e->error_msg, "unknown"); break; } if (transient) mce_snprintf(e->error_msg, "transient"); } enum banktype { BT_UNKNOWN, BT_PCU, BT_UPI, BT_M2M, BT_IMC, }; static enum banktype icelake[32] = { [4] = BT_PCU, [5] = BT_UPI, [7 ... 8] = BT_UPI, [12] = BT_M2M, [16] = BT_M2M, [20] = BT_M2M, [24] = BT_M2M, [13 ... 15] = BT_IMC, [17 ... 19] = BT_IMC, [21 ... 23] = BT_IMC, [25 ... 27] = BT_IMC, }; static enum banktype icelake_de[32] = { [4] = BT_PCU, [12] = BT_M2M, [16] = BT_M2M, [13 ... 15] = BT_IMC, [17 ... 19] = BT_IMC, }; static enum banktype tremont[32] = { [4] = BT_PCU, [12] = BT_M2M, [13 ... 15] = BT_IMC, }; static enum banktype sapphire[32] = { [4] = BT_PCU, [5] = BT_UPI, [12] = BT_M2M, [13 ... 20] = BT_IMC, }; void i10nm_memerr_misc(struct mce_event *e, int *channel); void i10nm_decode_model(enum cputype cputype, struct ras_events *ras, struct mce_event *e) { enum banktype banktype; uint64_t f, status = e->status; uint32_t mca = status & 0xffff; int channel = -1; switch (cputype) { case CPU_ICELAKE_XEON: banktype = icelake[e->bank]; break; case CPU_ICELAKE_DE: banktype = icelake_de[e->bank]; break; case CPU_TREMONT_D: banktype = tremont[e->bank]; break; case CPU_SAPPHIRERAPIDS: case CPU_EMERALDRAPIDS: banktype = sapphire[e->bank]; break; default: return; } switch (banktype) { case BT_UNKNOWN: break; case BT_PCU: mce_snprintf(e->error_msg, "PCU: "); f = EXTRACT(status, 24, 31); if (f) decode_bitfield(e, f, pcu1); f = EXTRACT(status, 20, 23); if (f) decode_bitfield(e, f, pcu2); f = EXTRACT(status, 16, 19); if (f) decode_bitfield(e, f, pcu3); break; case BT_UPI: mce_snprintf(e->error_msg, "UPI: "); f = EXTRACT(status, 22, 31); if (f) decode_bitfield(e, status, upi1); f = EXTRACT(status, 16, 21); decode_bitfield(e, f, upi2); break; case BT_M2M: mce_snprintf(e->error_msg, "M2M: "); f = EXTRACT(status, 24, 25); mce_snprintf(e->error_msg, "MscodDDRType=0x%" PRIx64, f); f = EXTRACT(status, 26, 31); mce_snprintf(e->error_msg, "MscodMiscErrs=0x%" PRIx64, f); decode_bitfield(e, status, m2m); break; case BT_IMC: mce_snprintf(e->error_msg, "MemCtrl: "); f = EXTRACT(status, 16, 23); switch (EXTRACT(status, 24, 31)) { case 0: decode_bitfield(e, f, imc0); break; case 1: decode_bitfield(e, f, imc1); break; case 2: decode_bitfield(e, f, imc2); break; case 4: decode_bitfield(e, f, imc4); break; case 8: decode_bitfield(e, f, imc8); break; case 0x10: decode_bitfield(e, f, imc10); break; } i10nm_imc_misc(e); break; } /* * Memory error specific code. Returns if the error is not a MC one */ /* Check if the error is at the memory controller */ if ((mca >> 7) != 1) return; /* Ignore unless this is an corrected extended error from an iMC bank */ if (banktype != BT_IMC || (status & MCI_STATUS_UC)) return; /* * Parse the reported channel */ i10nm_memerr_misc(e, &channel); if (channel == -1) return; mce_snprintf(e->mc_location, "memory_channel=%d", channel); } /* * There isn't enough information to identify the DIMM. But * we can derive the channel from the bank number. * There can be four memory controllers with two channels each. */ void i10nm_memerr_misc(struct mce_event *e, int *channel) { uint64_t status = e->status; unsigned int chan, imc; /* Check this is a memory error */ if (!test_prefix(7, status & 0xefff)) return; chan = EXTRACT(status, 0, 3); if (chan == 0xf) return; switch (e->bank) { case 12: /* M2M 0 */ case 13: /* IMC 0, Channel 0 */ case 14: /* IMC 0, Channel 1 */ case 15: /* IMC 0, Channel 2 */ imc = 0; break; case 16: /* M2M 1 */ case 17: /* IMC 1, Channel 0 */ case 18: /* IMC 1, Channel 1 */ case 19: /* IMC 1, Channel 2 */ imc = 1; break; case 20: /* M2M 2 */ case 21: /* IMC 2, Channel 0 */ case 22: /* IMC 2, Channel 1 */ case 23: /* IMC 2, Channel 2 */ imc = 2; break; case 24: /* M2M 3 */ case 25: /* IMC 3, Channel 0 */ case 26: /* IMC 3, Channel 1 */ case 27: /* IMC 3, Channel 2 */ imc = 3; break; default: return; } channel[0] = imc * 3 + chan; } 0707010000002D000081A400000000000000000000000165C04BE400001345000000000000000000000000000000000000002F00000000rasdaemon-0.8.0.49.git+f9cb13b/mce-intel-ivb.c/* * The code below came from Tony Luck mcelog code, * released under GNU Public General License, v.2 * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <string.h> #include <stdio.h> #include "ras-mce-handler.h" #include "bitfield.h" /* See IA32 SDM Vol3B Table 16-17 */ static char *pcu_1[] = { [0] = "No error", [1] = "Non_IMem_Sel", [2] = "I_Parity_Error", [3] = "Bad_OpCode", [4] = "I_Stack_Underflow", [5] = "I_Stack_Overflow", [6] = "D_Stack_Underflow", [7] = "D_Stack_Overflow", [8] = "Non-DMem_Sel", [9] = "D_Parity_Error" }; static char *pcu_2[] = { [0x00] = "No Error", [0x0D] = "MC_IMC_FORCE_SR_S3_TIMEOUT", [0x0E] = "MC_MC_CPD_UNCPD_ST_TIMEOUT", [0x0F] = "MC_PKGS_SAFE_WP_TIMEOUT", [0x43] = "MC_PECI_MAILBOX_QUIESCE_TIMEOUT", [0x44] = "MC_CRITICAL_VR_FAILED", [0x45] = "MC_ICC_MAX-NOTSUPPORTED", [0x5C] = "MC_MORE_THAN_ONE_LT_AGENT", [0x60] = "MC_INVALID_PKGS_REQ_PCH", [0x61] = "MC_INVALID_PKGS_REQ_QPI", [0x62] = "MC_INVALID_PKGS_RES_QPI", [0x63] = "MC_INVALID_PKGC_RES_PCH", [0x64] = "MC_INVALID_PKG_STATE_CONFIG", [0x70] = "MC_WATCHDG_TIMEOUT_PKGC_SLAVE", [0x71] = "MC_WATCHDG_TIMEOUT_PKGC_MASTER", [0x72] = "MC_WATCHDG_TIMEOUT_PKGS_MASTER", [0x7A] = "MC_HA_FAILSTS_CHANGE_DETECTED", [0x7B] = "MC_PCIE_R2PCIE-RW_BLOCK_ACK_TIMEOUT", [0x81] = "MC_RECOVERABLE_DIE_THERMAL_TOO_HOT", }; static struct field pcu_mc4[] = { FIELD(16, pcu_1), FIELD(24, pcu_2), {} }; /* See IA32 SDM Vol3B Table 16-18 */ static char *memctrl_1[] = { [0x001] = "Address parity error", [0x002] = "HA Wrt buffer Data parity error", [0x004] = "HA Wrt byte enable parity error", [0x008] = "Corrected patrol scrub error", [0x010] = "Uncorrected patrol scrub error", [0x020] = "Corrected spare error", [0x040] = "Uncorrected spare error", [0x080] = "Corrected memory read error", [0x100] = "iMC, WDB, parity errors", }; static struct field memctrl_mc9[] = { FIELD(16, memctrl_1), {} }; void ivb_decode_model(struct ras_events *ras, struct mce_event *e) { struct mce_priv *mce = ras->mce_priv; uint64_t status = e->status; uint32_t mca = status & 0xffff; unsigned int rank0 = -1, rank1 = -1, chan; switch (e->bank) { case 4: // Wprintf("PCU: "); decode_bitfield(e, e->status, pcu_mc4); // Wprintf("\n"); break; case 5: if (mce->cputype == CPU_IVY_BRIDGE_EPEX) { /* MCACOD already decoded */ mce_snprintf(e->bank_name, "QPI"); } break; case 9: case 10: case 11: case 12: case 13: case 14: case 15: case 16: // Wprintf("MemCtrl: "); decode_bitfield(e, e->status, memctrl_mc9); break; } /* * Memory error specific code. Returns if the error is not a MC one */ /* Check if the error is at the memory controller */ if ((mca >> 7) != 1) return; /* Ignore unless this is an corrected extended error from an iMC bank */ if (e->bank < 9 || e->bank > 16 || (status & MCI_STATUS_UC) || !test_prefix(7, status & 0xefff)) return; /* * Parse the reported channel and ranks */ chan = EXTRACT(status, 0, 3); if (chan == 0xf) return; mce_snprintf(e->mc_location, "memory_channel=%d", chan); if (EXTRACT(e->misc, 62, 62)) rank0 = EXTRACT(e->misc, 46, 50); if (EXTRACT(e->misc, 63, 63)) rank1 = EXTRACT(e->misc, 51, 55); /* * FIXME: The conversion from rank to dimm requires to parse the * DMI tables and call failrank2dimm(). */ if (rank0 >= 0 && rank1 >= 0) mce_snprintf(e->mc_location, "ranks=%d and %d", rank0, rank1); else if (rank0 >= 0) mce_snprintf(e->mc_location, "rank=%d", rank0); else mce_snprintf(e->mc_location, "rank=%d", rank1); } /* * Ivy Bridge EP and EX processors (family 6, model 62) support additional * logging for corrected errors in the integrated memory controller (IMC) * banks. The mode is off by default, but can be enabled by setting the * "MemError Log Enable" * bit in MSR_ERROR_CONTROL (MSR 0x17f). * The SDM leaves it as an exercise for the reader to convert the * faling rank to a DIMM slot. */ #if 0 static int failrank2dimm(unsigned int failrank, int socket, int channel) { switch (failrank) { case 0: case 1: case 2: case 3: return 0; case 4: case 5: return 1; case 6: case 7: if (get_memdimm(socket, channel, 2, 0)) return 2; else return 1; } return -1; } #endif 0707010000002E000081A400000000000000000000000165C04BE400000EA1000000000000000000000000000000000000002F00000000rasdaemon-0.8.0.49.git+f9cb13b/mce-intel-knl.c/* * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <string.h> #include <stdio.h> #include "ras-mce-handler.h" #include "bitfield.h" static struct field memctrl_mc7[] = { SBITFIELD(16, "CA Parity error"), SBITFIELD(17, "Internal Parity error except WDB"), SBITFIELD(18, "Internal Parity error from WDB"), SBITFIELD(19, "Correctable Patrol Scrub"), SBITFIELD(20, "Uncorrectable Patrol Scrub"), SBITFIELD(21, "Spare Correctable Error"), SBITFIELD(22, "Spare UC Error"), SBITFIELD(23, "CORR Chip fail even MC only, 4 bit burst error EDC only"), {} }; void knl_decode_model(struct ras_events *ras, struct mce_event *e) { uint64_t status = e->status; uint32_t mca = status & 0xffff; unsigned int rank0 = -1, rank1 = -1, chan = 0; switch (e->bank) { case 5: switch (EXTRACT(status, 0, 15)) { case 0x402: mce_snprintf(e->mcastatus_msg, "PCU Internal Errors"); break; case 0x403: mce_snprintf(e->mcastatus_msg, "VCU Internal Errors"); break; case 0x407: mce_snprintf(e->mcastatus_msg, "Other UBOX Internal Errors"); break; } break; case 7: case 8: case 9: case 10: case 11: case 12: case 13: case 14: case 15: case 16: if ((EXTRACT(status, 0, 15)) == 0x5) { mce_snprintf(e->mcastatus_msg, "Internal Parity error"); } else { chan = (EXTRACT(status, 0, 3)) + 3 * (e->bank == 15); switch (EXTRACT(status, 4, 7)) { case 0x0: mce_snprintf(e->mcastatus_msg, "Undefined request on channel %d", chan); break; case 0x1: mce_snprintf(e->mcastatus_msg, "Read on channel %d", chan); break; case 0x2: mce_snprintf(e->mcastatus_msg, "Write on channel %d", chan); break; case 0x3: mce_snprintf(e->mcastatus_msg, "CA error on channel %d", chan); break; case 0x4: mce_snprintf(e->mcastatus_msg, "Scrub error on channel %d", chan); break; } } decode_bitfield(e, status, memctrl_mc7); break; default: break; } /* * Memory error specific code. Returns if the error is not a MC one */ /* Check if the error is at the memory controller */ if ((mca >> 7) != 1) return; /* Ignore unless this is an corrected extended error from an iMC bank */ if (e->bank < 7 || e->bank > 16 || (status & MCI_STATUS_UC) || !test_prefix(7, status & 0xefff)) return; /* * Parse the reported channel and ranks */ chan = EXTRACT(status, 0, 3); if (chan == 0xf) { mce_snprintf(e->mc_location, "memory_channel=unspecified"); } else { chan = chan + 3 * (e->bank == 15); mce_snprintf(e->mc_location, "memory_channel=%d", chan); if (EXTRACT(e->misc, 62, 62)) rank0 = EXTRACT(e->misc, 46, 50); if (EXTRACT(e->misc, 63, 63)) rank1 = EXTRACT(e->misc, 51, 55); /* * FIXME: The conversion from rank to dimm requires to parse the * DMI tables and call failrank2dimm(). */ if (rank0 != -1 && rank1 != -1) mce_snprintf(e->mc_location, "ranks=%d and %d", rank0, rank1); else if (rank0 != -1) mce_snprintf(e->mc_location, "rank=%d", rank0); } } 0707010000002F000081A400000000000000000000000165C04BE400001406000000000000000000000000000000000000003300000000rasdaemon-0.8.0.49.git+f9cb13b/mce-intel-nehalem.c/* * The code below came from Andi Kleen/Intel/SuSe mcelog code, * released under GNU Public General License, v.2 * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <string.h> #include <stdio.h> #include "ras-mce-handler.h" #include "bitfield.h" /* See IA32 SDM Vol3B Appendix E.3.2 ff */ /* MC1_STATUS error */ static struct field qpi_status[] = { SBITFIELD(16, "QPI header had bad parity"), SBITFIELD(17, "QPI Data packet had bad parity"), SBITFIELD(18, "Number of QPI retries exceeded"), SBITFIELD(19, "Received QPI data packet that was poisoned by sender"), SBITFIELD(20, "QPI reserved 20"), SBITFIELD(21, "QPI reserved 21"), SBITFIELD(22, "QPI received unsupported message encoding"), SBITFIELD(23, "QPI credit type is not supported"), SBITFIELD(24, "Sender sent too many QPI flits to the receiver"), SBITFIELD(25, "QPI Sender sent a failed response to receiver"), SBITFIELD(26, "Clock jitter detected in internal QPI clocking"), {} }; static struct field qpi_misc[] = { SBITFIELD(14, "QPI misc reserved 14"), SBITFIELD(15, "QPI misc reserved 15"), SBITFIELD(24, "QPI Interleave/Head Indication Bit (IIB)"), {} }; static struct numfield qpi_numbers[] = { HEXNUMBER(0, 7, "QPI class and opcode of packet with error"), HEXNUMBER(8, 13, "QPI Request Transaction ID"), NUMBERFORCE(16, 18, "QPI Requestor/Home Node ID (RHNID)"), HEXNUMBER(19, 23, "QPI miscreserved 19-23"), {}, }; static struct field nhm_memory_status[] = { SBITFIELD(16, "Memory read ECC error"), SBITFIELD(17, "Memory ECC error occurred during scrub"), SBITFIELD(18, "Memory write parity error"), SBITFIELD(19, "Memory error in half of redundant memory"), SBITFIELD(20, "Memory reserved 20"), SBITFIELD(21, "Memory access out of range"), SBITFIELD(22, "Memory internal RTID invalid"), SBITFIELD(23, "Memory address parity error"), SBITFIELD(24, "Memory byte enable parity error"), {} }; static struct numfield nhm_memory_status_numbers[] = { HEXNUMBER(25, 37, "Memory MISC reserved 25..37"), NUMBERFORCE(38, 52, "Memory corrected error count (CORE_ERR_CNT)"), HEXNUMBER(53, 56, "Memory MISC reserved 53..56"), {} }; static struct numfield nhm_memory_misc_numbers[] = { HEXNUMBERFORCE(0, 7, "Memory transaction Tracker ID (RTId)"), NUMBERFORCE(16, 17, "Memory DIMM ID of error"), NUMBERFORCE(18, 19, "Memory channel ID of error"), HEXNUMBERFORCE(32, 63, "Memory ECC syndrome"), {} }; static char *internal_errors[] = { [0x0] = "No Error", [0x3] = "Reset firmware did not complete", [0x8] = "Received an invalid CMPD", [0xa] = "Invalid Power Management Request", [0xd] = "Invalid S-state transition", [0x11] = "VID controller does not match POC controller selected", [0x1a] = "MSID from POC does not match CPU MSID", }; static struct field internal_error_status[] = { FIELD(24, internal_errors), {} }; static struct numfield internal_error_numbers[] = { HEXNUMBER(16, 23, "Internal machine check status reserved 16..23"), HEXNUMBER(32, 56, "Internal machine check status reserved 32..56"), {}, }; /* Generic architectural memory controller encoding */ void nehalem_decode_model(struct mce_event *e) { uint64_t status = e->status; uint32_t mca = status & 0xffff; uint64_t misc = e->misc; unsigned int channel, dimm; if ((mca >> 11) == 1) { /* bus and interconnect QPI */ decode_bitfield(e, status, qpi_status); if (status & MCI_STATUS_MISCV) { decode_numfield(e, misc, qpi_numbers); decode_bitfield(e, misc, qpi_misc); } } else if (mca == 0x0001) { /* internal unspecified */ decode_bitfield(e, status, internal_error_status); decode_numfield(e, status, internal_error_numbers); } else if ((mca >> 7) == 1) { /* memory controller */ decode_bitfield(e, status, nhm_memory_status); decode_numfield(e, status, nhm_memory_status_numbers); if (status & MCI_STATUS_MISCV) decode_numfield(e, misc, nhm_memory_misc_numbers); } if ((((status & 0xffff) >> 7) == 1) && (status & MCI_STATUS_MISCV)) { channel = EXTRACT(e->misc, 18, 19); dimm = EXTRACT(e->misc, 16, 17); mce_snprintf(e->mc_location, "channel=%d, dimm=%d", channel, dimm); } } /* Only core errors supported. Same as Nehalem */ void xeon75xx_decode_model(struct mce_event *e) { uint64_t status = e->status; uint32_t mca = status & 0xffff; if (mca == 0x0001) { /* internal unspecified */ decode_bitfield(e, status, internal_error_status); decode_numfield(e, status, internal_error_numbers); } } 07070100000030000081A400000000000000000000000165C04BE400001118000000000000000000000000000000000000003100000000rasdaemon-0.8.0.49.git+f9cb13b/mce-intel-p4-p6.c/* * The code below came from Andi Kleen/Intel/SuSe mcelog code, * released under GNU Public General License, v.2 * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <string.h> #include <stdio.h> #include "ras-mce-handler.h" #include "bitfield.h" /* Decode P4 and P6 family (p6old and Core2) model specific errors */ /* [19..24] */ static char *bus_queue_req_type[] = { [0] = "BQ_DCU_READ_TYPE", [2] = "BQ_IFU_DEMAND_TYPE", [3] = "BQ_IFU_DEMAND_NC_TYPE", [4] = "BQ_DCU_RFO_TYPE", [5] = "BQ_DCU_RFO_LOCK_TYPE", [6] = "BQ_DCU_ITOM_TYPE", [8] = "BQ_DCU_WB_TYPE", [10] = "BC_DCU_WCEVICT_TYPE", [11] = "BQ_DCU_WCLINE_TYPE", [12] = "BQ_DCU_BTM_TYPE", [13] = "BQ_DCU_INTACK_TYPE", [14] = "BQ_DCU_INVALL2_TYPE", [15] = "BQ_DCU_FLUSHL2_TYPE", [16] = "BQ_DCU_PART_RD_TYPE", [18] = "BQ_DCU_PART_WR_TYPE", [20] = "BQ_DCU_SPEC_CYC_TYPE", [24] = "BQ_DCU_IO_RD_TYPE", [25] = "BQ_DCU_IO_WR_TYPE", [28] = "BQ_DCU_LOCK_RD_TYPE", [30] = "BQ_DCU_SPLOCK_RD_TYPE", [29] = "BQ_DCU_LOCK_WR_TYPE", }; /* [25..27] */ static char *bus_queue_error_type[] = { [0] = "BQ_ERR_HARD_TYPE", [1] = "BQ_ERR_DOUBLE_TYPE", [2] = "BQ_ERR_AERR2_TYPE", [4] = "BQ_ERR_SINGLE_TYPE", [5] = "BQ_ERR_AERR1_TYPE", }; static struct field p6_shared_status[] = { FIELD_NULL(16), FIELD(19, bus_queue_req_type), FIELD(25, bus_queue_error_type), FIELD(25, bus_queue_error_type), SBITFIELD(30, "internal BINIT"), SBITFIELD(36, "received parity error on response transaction"), SBITFIELD(38, "timeout BINIT (ROB timeout). No micro-instruction retired for some time"), FIELD_NULL(39), SBITFIELD(42, "bus transaction received hard error response"), SBITFIELD(43, "failure that caused IERR"), /* The following are reserved for Core in the SDM. Let's keep them here anyways*/ SBITFIELD(44, "two failing bus transactions with address parity error (AERR)"), SBITFIELD(45, "uncorrectable ECC error"), SBITFIELD(46, "correctable ECC error"), /* [47..54]: ECC syndrome */ FIELD_NULL(55), {}, }; static struct field p6old_status[] = { SBITFIELD(28, "FRC error"), SBITFIELD(29, "BERR on this CPU"), FIELD_NULL(31), FIELD_NULL(32), SBITFIELD(35, "BINIT received from external bus"), SBITFIELD(37, "Received hard error response on split transaction (Bus BINIT)"), {} }; static struct field core2_status[] = { SBITFIELD(28, "MCE driven"), SBITFIELD(29, "MCE is observed"), SBITFIELD(31, "BINIT observed"), FIELD_NULL(32), SBITFIELD(34, "PIC or FSB data parity error"), FIELD_NULL(35), SBITFIELD(37, "FSB address parity error detected"), {} }; static struct numfield p6old_status_numbers[] = { HEXNUMBER(47, 54, "ECC syndrome"), {} }; static struct { int value; char *str; } p4_model[] = { {16, "FSB address parity"}, {17, "Response hard fail"}, {18, "Response parity"}, {19, "PIC and FSB data parity"}, {20, "Invalid PIC request(Signature=0xF04H)"}, {21, "Pad state machine"}, {22, "Pad strobe glitch"}, {23, "Pad address glitch"} }; void p4_decode_model(struct mce_event *e) { uint32_t model = e->status & 0xffff0000L; unsigned int i; for (i = 0; i < ARRAY_SIZE(p4_model); i++) { if (model & (1 << p4_model[i].value)) mce_snprintf(e->error_msg, "%s", p4_model[i].str); } } void core2_decode_model(struct mce_event *e) { uint64_t status = e->status; decode_bitfield(e, status, p6_shared_status); decode_bitfield(e, status, core2_status); /* Normally reserved, but let's parse anyways: */ decode_numfield(e, status, p6old_status_numbers); } void p6old_decode_model(struct mce_event *e) { uint64_t status = e->status; decode_bitfield(e, status, p6_shared_status); decode_bitfield(e, status, p6old_status); decode_numfield(e, status, p6old_status_numbers); } 07070100000031000081A400000000000000000000000165C04BE400001420000000000000000000000000000000000000002E00000000rasdaemon-0.8.0.49.git+f9cb13b/mce-intel-sb.c/* * The code below came from Andi Kleen/Intel/SuSe mcelog code, * released under GNU Public General License, v.2 * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <string.h> #include <stdio.h> #include "ras-mce-handler.h" #include "bitfield.h" /* See IA32 SDM Vol3B Table 16.4.1 */ static char *pcu_1[] = { [0] = "No error", [1] = "Non_IMem_Sel", [2] = "I_Parity_Error", [3] = "Bad_OpCode", [4] = "I_Stack_Underflow", [5] = "I_Stack_Overflow", [6] = "D_Stack_Underflow", [7] = "D_Stack_Overflow", [8] = "Non-DMem_Sel", [9] = "D_Parity_Error" }; static char *pcu_2[] = { [0x00] = "No Error", [0x0D] = "MC_IMC_FORCE_SR_S3_TIMEOUT", [0x0E] = "MC_MC_CPD_UNCPD_ST_TIMEOUT", [0x0F] = "MC_PKGS_SAFE_WP_TIMEOUT", [0x43] = "MC_PECI_MAILBOX_QUIESCE_TIMEOUT", [0x5C] = "MC_MORE_THAN_ONE_LT_AGENT", [0x60] = "MC_INVALID_PKGS_REQ_PCH", [0x61] = "MC_INVALID_PKGS_REQ_QPI", [0x62] = "MC_INVALID_PKGS_RES_QPI", [0x63] = "MC_INVALID_PKGC_RES_PCH", [0x64] = "MC_INVALID_PKG_STATE_CONFIG", [0x70] = "MC_WATCHDG_TIMEOUT_PKGC_SLAVE", [0x71] = "MC_WATCHDG_TIMEOUT_PKGC_MASTER", [0x72] = "MC_WATCHDG_TIMEOUT_PKGS_MASTER", [0x7A] = "MC_HA_FAILSTS_CHANGE_DETECTED", [0x81] = "MC_RECOVERABLE_DIE_THERMAL_TOO_HOT", }; static struct field pcu_mc4[] = { FIELD(16, pcu_1), FIELD(24, pcu_2), {} }; static char *memctrl_1[] = { [0x001] = "Address parity error", [0x002] = "HA Wrt buffer Data parity error", [0x004] = "HA Wrt byte enable parity error", [0x008] = "Corrected patrol scrub error", [0x010] = "Uncorrected patrol scrub error", [0x020] = "Corrected spare error", [0x040] = "Uncorrected spare error", }; static struct field memctrl_mc8[] = { FIELD(16, memctrl_1), {} }; void snb_decode_model(struct ras_events *ras, struct mce_event *e) { struct mce_priv *mce = ras->mce_priv; uint32_t mca = e->status & 0xffff; unsigned int rank0 = -1, rank1 = -1, chan; switch (e->bank) { case 4: decode_bitfield(e, e->status, pcu_mc4); break; case 6: case 7: /* MCACOD already decoded */ if (mce->cputype == CPU_SANDY_BRIDGE_EP) mce_snprintf(e->bank_name, "QPI"); break; case 8: case 9: case 10: case 11: // Wprintf("MemCtrl: "); decode_bitfield(e, e->status, memctrl_mc8); break; } /* * Memory error specific code. Returns if the error is not a MC one */ /* Check if the error is at the memory controller */ if ((mca >> 7) != 1) return; /* Ignore unless this is an corrected extended error from an iMC bank */ if (e->bank < 8 || e->bank > 11 || (e->status & MCI_STATUS_UC) || !test_prefix(7, e->status & 0xefff)) return; /* * Parse the reported channel and ranks */ chan = EXTRACT(e->status, 0, 3); if (chan == 0xf) return; mce_snprintf(e->mc_location, "memory_channel=%d", chan); if (EXTRACT(e->misc, 62, 62)) rank0 = EXTRACT(e->misc, 46, 50); if (EXTRACT(e->misc, 63, 63)) rank1 = EXTRACT(e->misc, 51, 55); /* * FIXME: The conversion from rank to dimm requires to parse the * DMI tables and call failrank2dimm(). */ if (rank0 >= 0 && rank1 >= 0) mce_snprintf(e->mc_location, "ranks=%d and %d", rank0, rank1); else if (rank0 >= 0) mce_snprintf(e->mc_location, "rank=%d", rank0); else mce_snprintf(e->mc_location, "rank=%d", rank1); } #if 0 /* * Sandy Bridge EP and EP4S processors (family 6, model 45) support additional * logging for corrected errors in the integrated memory controller (IMC) * banks. The mode is off by default, but can be enabled by setting the * "MemError Log Enable" * bit in MSR_ERROR_CONTROL (MSR 0x17f). * The documentation in the August 2012 edition of Intel's Software developer * manual has some minor errors because the worng version of table 16-16 * "Intel IMC MC Error Codes for IA32_MCi_MISC (i= 8, 11)" was included. * Corrections are: * Bit 62 is the "VALID" bit for the "first-device" bits in MISC and STATUS * Bit 63 is the "VALID" bit for the "second-device" bits in MISC * Bits 58:56 and 61:59 should be marked as "reserved". * There should also be a footnote explaining how the "failing rank" fields * can be converted to a DIMM number within a channel for systems with either * two or three DIMMs per channel. */ static int failrank2dimm(unsigned int failrank, int socket, int channel) { switch (failrank) { case 0: case 1: case 2: case 3: return 0; case 4: case 5: return 1; case 6: case 7: if (get_memdimm(socket, channel, 2, 0)) return 2; else return 1; } return -1; } #endif 07070100000032000081A400000000000000000000000165C04BE400001FB5000000000000000000000000000000000000003800000000rasdaemon-0.8.0.49.git+f9cb13b/mce-intel-skylake-xeon.c/* * The code below came from Tony Luck's mcelog code, * released under GNU Public General License, v.2 * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <string.h> #include <stdio.h> #include "ras-mce-handler.h" #include "bitfield.h" /* See IA32 SDM Vol3B Table 16-27 */ static char *pcu_1[] = { [0x00] = "No Error", [0x0d] = "MCA_DMI_TRAINING_TIMEOUT", [0x0f] = "MCA_DMI_CPU_RESET_ACK_TIMEOUT", [0x10] = "MCA_MORE_THAN_ONE_LT_AGENT", [0x1e] = "MCA_BIOS_RST_CPL_INVALID_SEQ", [0x1f] = "MCA_BIOS_INVALID_PKG_STATE_CONFIG", [0x25] = "MCA_MESSAGE_CHANNEL_TIMEOUT", [0x27] = "MCA_MSGCH_PMREQ_CMP_TIMEOUT", [0x30] = "MCA_PKGC_DIRECT_WAKE_RING_TIMEOUT", [0x31] = "MCA_PKGC_INVALID_RSP_PCH", [0x33] = "MCA_PKGC_WATCHDOG_HANG_CBZ_DOWN", [0x34] = "MCA_PKGC_WATCHDOG_HANG_CBZ_UP", [0x38] = "MCA_PKGC_WATCHDOG_HANG_C3_UP_SF", [0x40] = "MCA_SVID_VCCIN_VR_ICC_MAX_FAILURE", [0x41] = "MCA_SVID_COMMAND_TIMEOUT", [0x42] = "MCA_SVID_VCCIN_VR_VOUT_FAILURE", [0x43] = "MCA_SVID_CPU_VR_CAPABILITY_ERROR", [0x44] = "MCA_SVID_CRITICAL_VR_FAILED", [0x45] = "MCA_SVID_SA_ITD_ERROR", [0x46] = "MCA_SVID_READ_REG_FAILED", [0x47] = "MCA_SVID_WRITE_REG_FAILED", [0x48] = "MCA_SVID_PKGC_INIT_FAILED", [0x49] = "MCA_SVID_PKGC_CONFIG_FAILED", [0x4a] = "MCA_SVID_PKGC_REQUEST_FAILED", [0x4b] = "MCA_SVID_IMON_REQUEST_FAILED", [0x4c] = "MCA_SVID_ALERT_REQUEST_FAILED", [0x4d] = "MCA_SVID_MCP_VR_ABSENT_OR_RAMP_ERROR", [0x4e] = "MCA_SVID_UNEXPECTED_MCP_VR_DETECTED", [0x51] = "MCA_FIVR_CATAS_OVERVOL_FAULT", [0x52] = "MCA_FIVR_CATAS_OVERCUR_FAULT", [0x58] = "MCA_WATCHDOG_TIMEOUT_PKGC_SLAVE", [0x59] = "MCA_WATCHDOG_TIMEOUT_PKGC_MASTER", [0x5a] = "MCA_WATCHDOG_TIMEOUT_PKGS_MASTER", [0x61] = "MCA_PKGS_CPD_UNCPD_TIMEOUT", [0x63] = "MCA_PKGS_INVALID_REQ_PCH", [0x64] = "MCA_PKGS_INVALID_REQ_INTERNAL", [0x65] = "MCA_PKGS_INVALID_RSP_INTERNAL", [0x6b] = "MCA_PKGS_SMBUS_VPP_PAUSE_TIMEOUT", [0x81] = "MCA_RECOVERABLE_DIE_THERMAL_TOO_HOT", }; static struct field pcu_mc4[] = { FIELD(24, pcu_1), {} }; /* See IA32 SDM Vol3B Table 16-28 */ static char *upi[] = { [0x00] = "UC Phy Initialization Failure", [0x01] = "UC Phy detected drift buffer alarm", [0x02] = "UC Phy detected latency buffer rollover", [0x10] = "UC LL Rx detected CRC error: unsuccessful LLR: entered abort state", [0x11] = "UC LL Rx unsupported or undefined packet", [0x12] = "UC LL or Phy control error", [0x13] = "UC LL Rx parameter exchange exception", [0x1F] = "UC LL detected control error from the link-mesh interface", [0x20] = "COR Phy initialization abort", [0x21] = "COR Phy reset", [0x22] = "COR Phy lane failure, recovery in x8 width", [0x23] = "COR Phy L0c error corrected without Phy reset", [0x24] = "COR Phy L0c error triggering Phy Reset", [0x25] = "COR Phy L0p exit error corrected with Phy reset", [0x30] = "COR LL Rx detected CRC error - successful LLR without Phy Reinit", [0x31] = "COR LL Rx detected CRC error - successful LLR with Phy Reinit", }; static struct field upi_mc[] = { FIELD(16, upi), {} }; /* These apply to MSCOD 0x12 "UC LL or Phy control error" */ static struct field upi_0x12[] = { SBITFIELD(22, "Phy Control Error"), SBITFIELD(23, "Unexpected Retry.Ack flit"), SBITFIELD(24, "Unexpected Retry.Req flit"), SBITFIELD(25, "RF parity error"), SBITFIELD(26, "Routeback Table error"), SBITFIELD(27, "unexpected Tx Protocol flit (EOP, Header or Data)"), SBITFIELD(28, "Rx Header-or-Credit BGF credit overflow/underflow"), SBITFIELD(29, "Link Layer Reset still in progress when Phy enters L0"), SBITFIELD(30, "Link Layer reset initiated while protocol traffic not idle"), SBITFIELD(31, "Link Layer Tx Parity Error"), {} }; /* See IA32 SDM Vol3B Table 16-29 */ static struct field mc_bits[] = { SBITFIELD(16, "Address parity error"), SBITFIELD(17, "HA write data parity error"), SBITFIELD(18, "HA write byte enable parity error"), SBITFIELD(19, "Corrected patrol scrub error"), SBITFIELD(20, "Uncorrected patrol scrub error"), SBITFIELD(21, "Corrected spare error"), SBITFIELD(22, "Uncorrected spare error"), SBITFIELD(23, "Any HA read error"), SBITFIELD(24, "WDB read parity error"), SBITFIELD(25, "DDR4 command address parity error"), SBITFIELD(26, "Uncorrected address parity error"), {} }; static char *mc_0x8xx[] = { [0x0] = "Unrecognized request type", [0x1] = "Read response to an invalid scoreboard entry", [0x2] = "Unexpected read response", [0x3] = "DDR4 completion to an invalid scoreboard entry", [0x4] = "Completion to an invalid scoreboard entry", [0x5] = "Completion FIFO overflow", [0x6] = "Correctable parity error", [0x7] = "Uncorrectable error", [0x8] = "Interrupt received while outstanding interrupt was not ACKed", [0x9] = "ERID FIFO overflow", [0xa] = "Error on Write credits", [0xb] = "Error on Read credits", [0xc] = "Scheduler error", [0xd] = "Error event", }; static struct field memctrl_mc13[] = { FIELD(16, mc_0x8xx), {} }; /* See IA32 SDM Vol3B Table 16-30 */ static struct field m2m[] = { SBITFIELD(16, "MscodDataRdErr"), SBITFIELD(17, "Reserved"), SBITFIELD(18, "MscodPtlWrErr"), SBITFIELD(19, "MscodFullWrErr"), SBITFIELD(20, "MscodBgfErr"), SBITFIELD(21, "MscodTimeout"), SBITFIELD(22, "MscodParErr"), SBITFIELD(23, "MscodBucket1Err"), {} }; void skylake_s_decode_model(struct ras_events *ras, struct mce_event *e) { uint64_t status = e->status; uint32_t mca = status & 0xffff; unsigned int rank0 = -1, rank1 = -1, chan; switch (e->bank) { case 4: switch (EXTRACT(status, 0, 15) & ~(1ull << 12)) { case 0x402: case 0x403: mce_snprintf(e->mcastatus_msg, "Internal errors "); break; case 0x406: mce_snprintf(e->mcastatus_msg, "Intel TXT errors "); break; case 0x407: mce_snprintf(e->mcastatus_msg, "Other UBOX Internal errors "); break; } if (EXTRACT(status, 16, 19)) mce_snprintf(e->mcastatus_msg, "PCU internal error "); decode_bitfield(e, status, pcu_mc4); break; case 5: case 12: case 19: mce_snprintf(e->mcastatus_msg, "UPI: "); decode_bitfield(e, status, upi_mc); if (EXTRACT(status, 16, 21) == 0x12) decode_bitfield(e, status, upi_0x12); break; case 7: case 8: mce_snprintf(e->mcastatus_msg, "M2M: "); decode_bitfield(e, status, m2m); break; case 13: case 14: case 15: case 16: case 17: case 18: mce_snprintf(e->mcastatus_msg, "MemCtrl: "); if (EXTRACT(status, 27, 27)) decode_bitfield(e, status, memctrl_mc13); else decode_bitfield(e, status, mc_bits); break; } /* * Memory error specific code. Returns if the error is not a MC one */ /* Check if the error is at the memory controller */ if ((mca >> 7) != 1) return; /* Ignore unless this is an corrected extended error from an iMC bank */ if (e->bank < 13 || e->bank > 18 || (status & MCI_STATUS_UC) || !test_prefix(7, status & 0xefff)) return; /* * Parse the reported channel and ranks */ chan = EXTRACT(status, 0, 3); if (chan == 0xf) return; mce_snprintf(e->mc_location, "memory_channel=%d", chan); if (EXTRACT(e->misc, 62, 62)) { rank0 = EXTRACT(e->misc, 46, 50); if (EXTRACT(e->misc, 63, 63)) rank1 = EXTRACT(e->misc, 51, 55); } /* * FIXME: The conversion from rank to dimm requires to parse the * DMI tables and call failrank2dimm(). */ if (rank0 != -1 && rank1 != -1) mce_snprintf(e->mc_location, "ranks=%d and %d", rank0, rank1); else if (rank0 != -1) mce_snprintf(e->mc_location, "rank=%d", rank0); } 07070100000033000081A400000000000000000000000165C04BE400001164000000000000000000000000000000000000003100000000rasdaemon-0.8.0.49.git+f9cb13b/mce-intel-tulsa.c/* * The code below came from Andi Kleen/Intel/SuSe mcelog code, * released under GNU Public General License, v.2 * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <string.h> #include <stdio.h> #include "ras-mce-handler.h" #include "bitfield.h" /* See IA32 SDM Vol3B Appendix E.4.1 ff */ static struct numfield corr_numbers[] = { NUMBER(32, 39, "Corrected events"), {} }; static struct numfield ecc_numbers[] = { HEXNUMBER(44, 51, "ECC syndrome"), {}, }; static struct field tls_bus_status[] = { SBITFIELD(16, "Parity error detected during FSB request phase"), SBITFIELD(17, "Partity error detected on Core 0 request's address field"), SBITFIELD(18, "Partity error detected on Core 1 request's address field"), FIELD_NULL(19), SBITFIELD(20, "Parity error on FSB response field detected"), SBITFIELD(21, "FSB data parity error on inbound date detected"), SBITFIELD(22, "Data parity error on data received from Core 0 detected"), SBITFIELD(23, "Data parity error on data received from Core 1 detected"), SBITFIELD(24, "Detected an Enhanced Defer parity error phase A or phase B"), SBITFIELD(25, "Data ECC event to error on inbound data correctable or uncorrectable"), SBITFIELD(26, "Pad logic detected a data strobe glitch or sequencing error"), SBITFIELD(27, "Pad logic detected a request strobe glitch or sequencing error"), FIELD_NULL(28), FIELD_NULL(31), {} }; static char *tls_front_error[0xf] = { [0x1] = "Inclusion error from core 0", [0x2] = "Inclusion error from core 1", [0x3] = "Write Exclusive error from core 0", [0x4] = "Write Exclusive error from core 1", [0x5] = "Inclusion error from FSB", [0x6] = "SNP stall error from FSB", [0x7] = "Write stall error from FSB", [0x8] = "FSB Arbiter Timeout error", [0x9] = "CBC OOD Queue Underflow/overflow", }; static char *tls_int_error[0xf] = { [0x1] = "Enhanced Intel SpeedStep Technology TM1-TM2 Error", [0x2] = "Internal timeout error", [0x3] = "Internal timeout error", [0x4] = "Intel Cache Safe Technology Queue full error\n" "or disabled ways in a set overflow", }; struct field tls_int_status[] = { FIELD(8, tls_int_error), {} }; struct field tls_front_status[] = { FIELD(0, tls_front_error), {} }; struct field tls_cecc[] = { SBITFIELD(0, "Correctable ECC event on outgoing FSB data"), SBITFIELD(1, "Correctable ECC event on outgoing core 0 data"), SBITFIELD(2, "Correctable ECC event on outgoing core 1 data"), {} }; struct field tls_uecc[] = { SBITFIELD(0, "Uncorrectable ECC event on outgoing FSB data"), SBITFIELD(1, "Uncorrectable ECC event on outgoing core 0 data"), SBITFIELD(2, "Uncorrectable ECC event on outgoing core 1 data"), {} }; static void tulsa_decode_bus(struct mce_event *e, uint64_t status) { decode_bitfield(e, status, tls_bus_status); } static void tulsa_decode_internal(struct mce_event *e, uint64_t status) { uint32_t mca = (status >> 16) & 0xffff; if ((mca & 0xfff0) == 0) decode_bitfield(e, mca, tls_front_status); else if ((mca & 0xf0ff) == 0) decode_bitfield(e, mca, tls_int_status); else if ((mca & 0xfff0) == 0xc000) decode_bitfield(e, mca, tls_cecc); else if ((mca & 0xfff0) == 0xe000) decode_bitfield(e, mca, tls_uecc); } void tulsa_decode_model(struct mce_event *e) { decode_numfield(e, e->status, corr_numbers); if (e->status & (1ULL << 52)) decode_numfield(e, e->status, ecc_numbers); /* MISC register not documented in the SDM. Let's just dump hex for now. */ if (e->status & MCI_STATUS_MISCV) mce_snprintf(e->mcistatus_msg, "MISC format %llx value %llx\n", (long long)(e->status >> 40) & 3, (long long)e->misc); if ((e->status & 0xffff) == 0xe0f) tulsa_decode_bus(e, e->status); else if ((e->status & 0xffff) == (1 << 10)) tulsa_decode_internal(e, e->status); } 07070100000034000081A400000000000000000000000165C04BE400002F5E000000000000000000000000000000000000002B00000000rasdaemon-0.8.0.49.git+f9cb13b/mce-intel.c/* * Copyright (C) 2013 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> * * The code below were adapted from Andi Kleen/Intel/SuSe mcelog code, * released under GNU Public General License, v.2 * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <errno.h> #include <fcntl.h> #include <string.h> #include <stdio.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include "ras-logger.h" #include "ras-mce-handler.h" #include "bitfield.h" #define MCE_THERMAL_BANK (MCE_EXTENDED_BANK + 0) #define MCE_TIMEOUT_BANK (MCE_EXTENDED_BANK + 90) #define TLB_LL_MASK 0x3 /*bit 0, bit 1*/ #define TLB_LL_SHIFT 0x0 #define TLB_TT_MASK 0xc /*bit 2, bit 3*/ #define TLB_TT_SHIFT 0x2 #define CACHE_LL_MASK 0x3 /*bit 0, bit 1*/ #define CACHE_LL_SHIFT 0x0 #define CACHE_TT_MASK 0xc /*bit 2, bit 3*/ #define CACHE_TT_SHIFT 0x2 #define CACHE_RRRR_MASK 0xF0 /*bit 4, bit 5, bit 6, bit 7 */ #define CACHE_RRRR_SHIFT 0x4 #define BUS_LL_MASK 0x3 /* bit 0, bit 1*/ #define BUS_LL_SHIFT 0x0 #define BUS_II_MASK 0xc /*bit 2, bit 3*/ #define BUS_II_SHIFT 0x2 #define BUS_RRRR_MASK 0xF0 /*bit 4, bit 5, bit 6, bit 7 */ #define BUS_RRRR_SHIFT 0x4 #define BUS_T_MASK 0x100 /*bit 8*/ #define BUS_T_SHIFT 0x8 #define BUS_PP_MASK 0x600 /*bit 9, bit 10*/ #define BUS_PP_SHIFT 0x9 #define MCG_TES_P BIT_ULL(11) /* Yellow bit cache threshold supported */ static char *TT[] = { "Instruction", "Data", "Generic", "Unknown" }; static char *LL[] = { "Level-0", "Level-1", "Level-2", "Level-3" }; static struct { uint8_t value; char *str; } RRRR[] = { {0, "Generic"}, {1, "Read"}, {2, "Write" }, {3, "Data-Read"}, {4, "Data-Write"}, {5, "Instruction-Fetch"}, {6, "Prefetch"}, {7, "Eviction"}, {8, "Snoop"} }; static char *PP[] = { "Local-CPU-originated-request", "Responed-to-request", "Observed-error-as-third-party", "Generic" }; static char *T[] = { "Request-did-not-timeout", "Request-timed-out" }; static char *II[] = { "Memory-access", "Reserved", "IO", "Other-transaction" }; static char *mca_msg[] = { [0] = "No Error", [1] = "Unclassified", [2] = "Microcode ROM parity error", [3] = "External error", [4] = "FRC error", [5] = "Internal parity error", [6] = "SMM Handler Code Access Violation", }; static char *tracking_msg[] = { [1] = "green", [2] = "yellow", [3] = "res3" }; static const char *arstate[4] = { [0] = "UCNA", [1] = "AR", [2] = "SRAO", [3] = "SRAR" }; static char *mmm_mnemonic[] = { "GEN", "RD", "WR", "AC", "MS", "RES5", "RES6", "RES7" }; static char *mmm_desc[] = { "Generic undefined request", "Memory read error", "Memory write error", "Address/Command error", "Memory scrubbing error", "Reserved 5", "Reserved 6", "Reserved 7" }; static void decode_memory_controller(struct mce_event *e, uint32_t status) { char channel[30]; if ((status & 0xf) == 0xf) sprintf(channel, "unspecified"); else sprintf(channel, "%u", status & 0xf); mce_snprintf(e->error_msg, "MEMORY CONTROLLER %s_CHANNEL%s_ERR", mmm_mnemonic[(status >> 4) & 7], channel); mce_snprintf(e->error_msg, "Transaction: %s", mmm_desc[(status >> 4) & 7]); } static void decode_termal_bank(struct mce_event *e) { if (e->status & 1) { mce_snprintf(e->mcgstatus_msg, "Processor %d heated above trip temperature. Throttling enabled.", e->cpu); mce_snprintf(e->user_action, "Please check your system cooling. Performance will be impacted"); } else { mce_snprintf(e->error_msg, "Processor %d below trip temperature. Throttling disabled", e->cpu); } } static void decode_mcg(struct mce_event *e) { uint64_t mcgstatus = e->mcgstatus; mce_snprintf(e->mcgstatus_msg, "mcgstatus=%lld", (long long)e->mcgstatus); if (mcgstatus & MCG_STATUS_RIPV) mce_snprintf(e->mcgstatus_msg, "RIPV"); if (mcgstatus & MCG_STATUS_EIPV) mce_snprintf(e->mcgstatus_msg, "EIPV"); if (mcgstatus & MCG_STATUS_MCIP) mce_snprintf(e->mcgstatus_msg, "MCIP"); if (mcgstatus & MCG_STATUS_LMCE) mce_snprintf(e->mcgstatus_msg, "LMCE"); } static void bank_name(struct mce_event *e) { char *buf = e->bank_name; switch (e->bank) { case MCE_THERMAL_BANK: strcpy(buf, "THERMAL EVENT"); break; case MCE_TIMEOUT_BANK: strcpy(buf, "Timeout waiting for exception on other CPUs"); break; default: break; } } static char *get_RRRR_str(uint8_t rrrr) { unsigned int i; for (i = 0; i < ARRAY_SIZE(RRRR); i++) { if (RRRR[i].value == rrrr) { return RRRR[i].str; } } return "UNKNOWN"; } #define decode_attr(arr, val) ({ \ char *__str; \ if ((unsigned int)(val) >= ARRAY_SIZE(arr)) \ __str = "UNKNOWN"; \ else \ __str = (arr)[val]; \ __str; \ }) static void decode_mca(struct mce_event *e, uint64_t track, int *ismemerr) { uint32_t mca = e->status & 0xffffL; if (mca & (1UL << 12)) { mce_snprintf(e->mcastatus_msg, "corrected filtering (some unreported errors in same region)"); mca &= ~(1UL << 12); } if (mca < ARRAY_SIZE(mca_msg)) { mce_snprintf(e->mcastatus_msg, "%s", mca_msg[mca]); return; } if ((mca >> 2) == 3) { mce_snprintf(e->mcastatus_msg, "%s Generic memory hierarchy error", decode_attr(LL, mca & 3)); } else if (test_prefix(4, mca)) { mce_snprintf(e->mcastatus_msg, "%s TLB %s Error", decode_attr(TT, (mca & TLB_TT_MASK) >> TLB_TT_SHIFT), decode_attr(LL, (mca & TLB_LL_MASK) >> TLB_LL_SHIFT)); } else if (test_prefix(8, mca)) { unsigned int typenum = (mca & CACHE_TT_MASK) >> CACHE_TT_SHIFT; unsigned int levelnum = (mca & CACHE_LL_MASK) >> CACHE_LL_SHIFT; char *type = decode_attr(TT, typenum); char *level = decode_attr(LL, levelnum); mce_snprintf(e->mcastatus_msg, "%s CACHE %s %s Error", type, level, get_RRRR_str((mca & CACHE_RRRR_MASK) >> CACHE_RRRR_SHIFT)); #if 0 /* FIXME: We shouldn't mix parsing with actions */ if (track == 2) run_yellow_trigger(e->cpu, typenum, levelnum, type, level, e->socket); #endif } else if (test_prefix(10, mca)) { if (mca == 0x400) mce_snprintf(e->mcastatus_msg, "Internal Timer error"); else mce_snprintf(e->mcastatus_msg, "Internal unclassified error: %x", mca); } else if (test_prefix(11, mca)) { mce_snprintf(e->mcastatus_msg, "BUS %s %s %s %s %s Error", decode_attr(LL, (mca & BUS_LL_MASK) >> BUS_LL_SHIFT), decode_attr(PP, (mca & BUS_PP_MASK) >> BUS_PP_SHIFT), get_RRRR_str((mca & BUS_RRRR_MASK) >> BUS_RRRR_SHIFT), decode_attr(II, (mca & BUS_II_MASK) >> BUS_II_SHIFT), decode_attr(T, (mca & BUS_T_MASK) >> BUS_T_SHIFT)); } else if (test_prefix(7, mca)) { decode_memory_controller(e, mca); *ismemerr = 1; } else mce_snprintf(e->mcastatus_msg, "Unknown Error %x", mca); } static void decode_tracking(struct mce_event *e, uint64_t track) { if (track == 1) mce_snprintf(e->user_action, "Large number of corrected cache errors. System operating, but might leadto uncorrected errors soon"); if (track) mce_snprintf(e->mcistatus_msg, "Threshold based error status: %s", tracking_msg[track]); } static void decode_mci(struct mce_event *e, int *ismemerr) { uint64_t track = 0; if (!(e->status & MCI_STATUS_VAL)) mce_snprintf(e->mcistatus_msg, "MCE_INVALID"); if (e->status & MCI_STATUS_OVER) mce_snprintf(e->mcistatus_msg, "Error_overflow"); /* FIXME: convert into severity */ if (e->status & MCI_STATUS_UC) mce_snprintf(e->mcistatus_msg, "Uncorrected_error"); else mce_snprintf(e->mcistatus_msg, "Corrected_error"); if (e->status & MCI_STATUS_EN) mce_snprintf(e->mcistatus_msg, "Error_enabled"); if (e->status & MCI_STATUS_PCC) mce_snprintf(e->mcistatus_msg, "Processor_context_corrupt"); if (e->status & (MCI_STATUS_S | MCI_STATUS_AR)) mce_snprintf(e->mcistatus_msg, "%s", arstate[(e->status >> 55) & 3]); if ((e->mcgcap == 0 || (e->mcgcap & MCG_TES_P)) && !(e->status & MCI_STATUS_UC)) { track = (e->status >> 53) & 3; decode_tracking(e, track); } decode_mca(e, track, ismemerr); } int parse_intel_event(struct ras_events *ras, struct mce_event *e) { struct mce_priv *mce = ras->mce_priv; int ismemerr; bank_name(e); if (e->bank == MCE_THERMAL_BANK) { decode_termal_bank(e); return 0; } decode_mcg(e); decode_mci(e, &ismemerr); /* Check if the error is at the memory controller */ if (((e->status & 0xffff) >> 7) == 1) { unsigned int corr_err_cnt; corr_err_cnt = EXTRACT(e->status, 38, 52); mce_snprintf(e->mc_location, "n_errors=%d", corr_err_cnt); } if (test_prefix(11, (e->status & 0xffffL))) { switch (mce->cputype) { case CPU_P6OLD: p6old_decode_model(e); break; case CPU_DUNNINGTON: case CPU_CORE2: case CPU_NEHALEM: case CPU_XEON75XX: core2_decode_model(e); break; case CPU_TULSA: case CPU_P4: p4_decode_model(e); break; default: break; } } switch (mce->cputype) { case CPU_NEHALEM: nehalem_decode_model(e); break; case CPU_XEON75XX: xeon75xx_decode_model(e); break; case CPU_DUNNINGTON: dunnington_decode_model(e); break; case CPU_TULSA: tulsa_decode_model(e); break; case CPU_SANDY_BRIDGE: case CPU_SANDY_BRIDGE_EP: snb_decode_model(ras, e); break; case CPU_IVY_BRIDGE_EPEX: ivb_decode_model(ras, e); break; case CPU_HASWELL_EPEX: hsw_decode_model(ras, e); break; case CPU_KNIGHTS_LANDING: case CPU_KNIGHTS_MILL: knl_decode_model(ras, e); break; case CPU_BROADWELL_DE: broadwell_de_decode_model(ras, e); break; case CPU_BROADWELL_EPEX: broadwell_epex_decode_model(ras, e); break; case CPU_SKYLAKE_XEON: skylake_s_decode_model(ras, e); break; case CPU_ICELAKE_XEON: case CPU_ICELAKE_DE: case CPU_TREMONT_D: case CPU_SAPPHIRERAPIDS: case CPU_EMERALDRAPIDS: i10nm_decode_model(mce->cputype, ras, e); default: break; } return 0; } /* * Code to enable iMC logs */ static int domsr(int cpu, int msr, int bit) { char fpath[32]; unsigned long long data; int fd; sprintf(fpath, "/dev/cpu/%d/msr", cpu); fd = open(fpath, O_RDWR); if (fd == -1) { switch (errno) { case ENOENT: log(ALL, LOG_ERR, "Warning: cpu %d offline?, imc_log not set\n", cpu); return -EINVAL; default: log(ALL, LOG_ERR, "Cannot open %s to set imc_log\n", fpath); return -EINVAL; } } if (pread(fd, &data, sizeof(data), msr) != sizeof(data)) { log(ALL, LOG_ERR, "Cannot read MSR_ERROR_CONTROL from %s\n", fpath); return -EINVAL; } data |= bit; if (pwrite(fd, &data, sizeof(data), msr) != sizeof(data)) { log(ALL, LOG_ERR, "Cannot write MSR_ERROR_CONTROL to %s\n", fpath); return -EINVAL; } if (pread(fd, &data, sizeof(data), msr) != sizeof(data)) { log(ALL, LOG_ERR, "Cannot re-read MSR_ERROR_CONTROL from %s\n", fpath); return -EINVAL; } if ((data & bit) == 0) { log(ALL, LOG_ERR, "Failed to set imc_log on cpu %d\n", cpu); return -EINVAL; } close(fd); return 0; } int set_intel_imc_log(enum cputype cputype, unsigned int ncpus) { int cpu, msr, bit, rc; switch (cputype) { case CPU_SANDY_BRIDGE_EP: case CPU_IVY_BRIDGE_EPEX: case CPU_HASWELL_EPEX: case CPU_KNIGHTS_LANDING: case CPU_KNIGHTS_MILL: msr = 0x17f; /* MSR_ERROR_CONTROL */ bit = 0x2; /* MemError Log Enable */ break; default: return 0; } for (cpu = 0; cpu < ncpus; cpu++) { rc = domsr(cpu, msr, bit); if (rc) return rc; } return 0; } 07070100000035000041ED00000000000000000000000265C04BE400000000000000000000000000000000000000000000002400000000rasdaemon-0.8.0.49.git+f9cb13b/misc07070100000036000081A400000000000000000000000165C04BE4000000CA000000000000000000000000000000000000003A00000000rasdaemon-0.8.0.49.git+f9cb13b/misc/ras-mc-ctl.service.in[Unit] Description=Initialize EDAC v3.0.0 Drivers For Machine Hardware [Service] Type=oneshot ExecStart=@sbindir@/ras-mc-ctl --register-labels RemainAfterExit=yes [Install] WantedBy=multi-user.target 07070100000037000081A400000000000000000000000165C04BE400000654000000000000000000000000000000000000003200000000rasdaemon-0.8.0.49.git+f9cb13b/misc/rasdaemon.env# Page Isolation # Note: Run-time configuration is unsupported, service restart needed. # Note: this file should be installed at /etc/sysconfig/rasdaemon # Specify the threshold of isolating buggy pages. # # Format: # [0-9]+[unit] # Notice: please make sure match this format, rasdaemon will use default value for exception input cases. # # Supported units: # PAGE_CE_REFRESH_CYCLE: D|d (day), H|h (hour), M|m (min), default is in hour # PAGE_CE_THRESHOLD: K|k (x1000), M|m (x1000k), default is none # # The two configs will only take no effect when PAGE_CE_ACTION is "off". PAGE_CE_REFRESH_CYCLE="24h" PAGE_CE_THRESHOLD="50" # Specify the internal action in rasdaemon to exceeding a page error threshold. # # off no action # account only account errors # soft try to soft-offline page without killing any processes # This requires an uptodate kernel. Might not be successfull. # hard try to hard-offline page by killing processes # Requires an uptodate kernel. Might not be successfull. # soft-then-hard First try to soft offline, then try hard offlining. # Note: default offline choice is "soft". PAGE_CE_ACTION="soft" # CPU Online Fault Isolation # Whether to enable cpu online fault isolation (yes|no). CPU_ISOLATION_ENABLE="no" # Specify the threshold of CE numbers. # # Format: # [0-9]+[unit] # # Supported units: # CPU_CE_THRESHOLD: no unit # CPU_ISOLATION_CYCLE: D|d (day), H|h (hour), M|m (minute), S|s (second), default is in second CPU_CE_THRESHOLD="18" CPU_ISOLATION_CYCLE="24h" # Prevent excessive isolation from causing an avalanche effect CPU_ISOLATION_LIMIT="10"07070100000038000081A400000000000000000000000165C04BE40000016C000000000000000000000000000000000000003900000000rasdaemon-0.8.0.49.git+f9cb13b/misc/rasdaemon.service.in[Unit] Description=RAS daemon to log the RAS events # only needed when not running in foreground (--foreground | -f) #After=syslog.target [Service] EnvironmentFile=@SYSCONFDEFDIR@/rasdaemon ExecStart=@sbindir@/rasdaemon -f -r ExecStartPost=@sbindir@/rasdaemon --enable ExecStop=@sbindir@/rasdaemon --disable Restart=on-abort [Install] WantedBy=multi-user.target 07070100000039000081A400000000000000000000000165C04BE400001518000000000000000000000000000000000000003600000000rasdaemon-0.8.0.49.git+f9cb13b/misc/rasdaemon.spec.inName: @PACKAGE@ Version: @PACKAGE_VERSION@ Release: 1%{?dist} Summary: Utility to receive RAS error tracings Group: Applications/System License: GPLv2 URL: http://git.infradead.org/users/mchehab/rasdaemon.git Source0: http://www.infradead.org/~mchehab/rasdaemon/%{name}-%{version}.tar.bz2 ExcludeArch: s390 s390x BuildRequires: make BuildRequires: gcc BuildRequires: gettext-devel BuildRequires: perl-generators BuildRequires: sqlite-devel BuildRequires: systemd BuildRequires: libtraceevent-devel Provides: bundled(kernel-event-lib) Requires: hwdata Requires: perl-DBD-SQLite Requires: libtraceevent %ifarch %{ix86} x86_64 Requires: dmidecode %endif Requires(post): systemd Requires(preun): systemd Requires(postun): systemd %description %{name} is a RAS (Reliability, Availability and Serviceability) logging tool. It currently records memory errors, using the EDAC tracing events. EDAC is drivers in the Linux kernel that handle detection of ECC errors from memory controllers for most chipsets on i386 and x86_64 architectures. EDAC drivers for other architectures like arm also exists. This userspace component consists of an init script which makes sure EDAC drivers and DIMM labels are loaded at system startup, as well as an utility for reporting current error counts from the EDAC sysfs files. %prep %setup -q %build %configure --enable-all --with-sysconfdefdir=%{_sysconfdir}/sysconfig make %{?_smp_mflags} %install make install DESTDIR=%{buildroot} install -D -p -m 0644 misc/rasdaemon.service %{buildroot}%{_unitdir}/rasdaemon.service install -D -p -m 0644 misc/ras-mc-ctl.service %{buildroot}%{_unitdir}/ras-mc-ctl.service install -D -p -m 0655 misc/rasdaemon.env %{buildroot}%{_sysconfdir}/sysconfig/%{name} rm INSTALL %{buildroot}/usr/include/*.h %files %doc AUTHORS ChangeLog COPYING README.md TODO %{_sbindir}/rasdaemon %{_sbindir}/ras-mc-ctl %{_mandir}/*/* %{_unitdir}/*.service %{_sysconfdir}/ras/dimm_labels.d %config(noreplace) %{_sysconfdir}/sysconfig/%{name} %changelog * Fri Apr 01 2022 Mauro Carvalho Chehab <mchehab+huawei@kernel.org> 0.6.7-3 - Fix sysconfdir issues * Wed May 26 2021 Mauro Carvalho Chehab <mchehab+huawei@kernel.org> 0.6.7-1 - Bump to version 0.6.7 with several fixes and additions * Tue Jul 21 2020 Mauro Carvalho Chehab <mchehab+huawei@kernel.org> 0.6.6-1 - Bump to version 0.6.6 with several fixes, new hip08 events and memory prediction analysis * Wed Nov 20 2019 Mauro Carvalho Chehab <mchehab+huawei@kernel.org> 0.6.5-1 - Bump to version 0.6.5 with several fixes and improves PCIe events record * Thu Oct 10 2019 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> 0.6.4-1 - Bump to version 0.6.4 with some DB changes for hip08 and some fixes * Fri Aug 23 2019 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> 0.6.3-1 - Bump to version 0.6.3 with new ARM events, plus disk I/O and netlink support * Tue Aug 14 2018 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> 0.6.2-1 - Bump to version 0.6.2 with improvements for PCIe AER parsing and at ras-mc-ctl tool * Wed Apr 25 2018 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> 0.6.1-1 - Bump to version 0.6.1 adding support for Skylake Xeon MSCOD, a bug fix and some new DELL labels * Sat Oct 14 2017 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> 0.6.0-1 - Bump to version 0.6.0 adding support for Arm and Hisilicon events and update Dell Skylate labels * Thu Jun 08 2017 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> 0.5.9-1 - Bump to version 0.5.9 adding support for Knights Mill and update DELL labels * Fri Apr 15 2016 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> 0.5.8-1 - Bump to version 0.5.8 adding support for Broadwell EP/EX MSCOD and Broadwell DE MSCOD * Fri Feb 05 2016 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> 0.5.7-1 - Bump to version 0.5.7 adding support for Broadwell-EP/EX and -DE and Knights Landing processors * Fri Jul 03 2015 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> 0.5.6-1 - Bump to version 0.5.6 with support for LMCE and some fixes * Wed Jun 03 2015 Mauro Carvalho Chehab <mchehab+samsung@kernel.org> 0.5.5-1 - Bump to version 0.5.5 with support for newer Intel platforms & some fixes * Fri Aug 15 2014 Mauro Carvalho Chehab <m.chehab@samsung.com> 0.5.4-1 - Bump to version 0.5.4 with some fixes, mainly for amd64 * Sun Aug 10 2014 Mauro Carvalho Chehab <m.chehab@samsung.com> 0.5.3-1 - Bump to version 0.5.3 and enable ABRT and ExtLog * Thu Apr 03 2014 Mauro Carvalho Chehab <m.chehab@samsung.com> 0.5.2-1 - fix and enable ABRT report support * Fri Mar 28 2014 Mauro Carvalho Chehab <m.chehab@samsung.com> 0.5.1-1 - Do some fixes at the service files and add some documentation for --record * Sun Feb 16 2014 Mauro Carvalho Chehab <m.chehab@samsung.com> 0.5.0-1 - Add experimental ABRT support * Tue Sep 10 2013 Mauro Carvalho Chehab <m.chehab@samsung.com> 0.4.2-1 - Fix ras-mc-ctl layout filling * Wed May 29 2013 Mauro Carvalho Chehab <mchehab@redhat.com> 0.4.1-2 - Fix the name of perl-DBD-SQLite package * Wed May 29 2013 Mauro Carvalho Chehab <mchehab@redhat.com> 0.4.1-1 - Updated to version 0.4.1 with contains some bug fixes * Tue May 28 2013 Mauro Carvalho Chehab <mchehab@redhat.com> 0.4.0-1 - Updated to version 0.4.0 and added support for mce, aer and sqlite3 storage * Mon May 20 2013 Mauro Carvalho Chehab <mchehab@redhat.com> 0.3.0-1 - Package created 0707010000003A000081ED00000000000000000000000165C04BE4000001DF000000000000000000000000000000000000002A00000000rasdaemon-0.8.0.49.git+f9cb13b/new_ver.sh#!/bin/bash autoreconf && ./configure --enable-all VER="`perl -ne 'print "$1\n" if (/Version:\s*(.*)/);' misc/rasdaemon.spec`" if [ "x$VER" == "x" ]; then echo "Can't parse rasdaemon version" exit -1 fi echo echo "************************************************************************" echo "Building RPM files for version: $VER" echo "************************************************************************" echo git tag v$VER -f && make mock && make upload && git push 0707010000003B000081A400000000000000000000000165C04BE40000825F000000000000000000000000000000000000003500000000rasdaemon-0.8.0.49.git+f9cb13b/non-standard-ampere.c/* * Copyright (c) 2020, Ampere Computing LLC. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <stdbool.h> #include "ras-record.h" #include "ras-logger.h" #include "ras-report.h" #include "ras-non-standard-handler.h" #include "non-standard-ampere.h" /*Armv8 RAS compicant Error Record(APEI and BMC Reporting) Payload Type 0*/ static const char * const disp_payload0_err_reg_name[] = { "Error Type:", "Error SubType:", "Error Instance:", "Processor Socket:", "Status:", "Address:", "MISC0:", "MISC1:", "MISC2:", "MISC3:", }; /*PCIe AER Error Payload Type 1*/ static const char * const disp_payload1_err_reg_name[] = { "Error Type:", "Error Subtype:", "Error Instance:", "Processor Socket:", "AER_UNCORR_ERR_STATUS:", "AER_UNCORR_ERR_MASK:", "AER_UNCORR_ERR_SEV:", "AER_CORR_ERR_STATUS:", "AER_CORR_ERR_MASK:", "AER_ROOT_ERR_CMD:", "AER_ROOT_ERR_STATUS:", "AER_ERR_SRC_ID:", "Reserved:", "Reserved:", }; /*PCIe RAS Dat Path(RASDP), Payload Type 2 */ static const char * const disp_payload2_err_reg_name[] = { "Error Type:", "Error Subtype:", "Error Instance:", "Processor Socket:", "CE Report Register:", "CE Location Register:", "CE Address:", "UE Reprot Register:", "UE Location Register:", "UE Address:", "Reserved:", "Reserved:", "Reserved:", }; /*Firmware-Specific Data(ATF, SMPro, PMpro, and BERT), Payload Type 3 */ static const char * const disp_payload3_err_reg_name[] = { "Error Type:", "Error Subtype:", "Error Instance:", "Processor Socket:", "Firmware-Specific Data 0:", "Firmware-Specific Data 1:", "Firmware-Specific Data 2:", "Firmware-Specific Data 3:", "Firmware-Specific Data 4:", "Firmware-Specific Data 5:", }; static const char * const err_cpm_sub_type[] = { "Snoop-Logic", "ARMv8 Core 0", "ARMv8 Core 1", }; static const char * const err_mcu_sub_type[] = { "ERR0", "ERR1", "ERR2", "ERR3", "ERR4", "ERR5", "ERR6", "Link Error", }; static const char * const err_mesh_sub_type[] = { "Cross Point", "Home Node(IO)", "Home Node(Memory)", "CCIX Node", }; static const char * const err_2p_link_ms_sub_type[] = { "ERR0", "ERR1", "ERR2", "ERR3", }; static const char * const err_gic_sub_type[] = { "ERR0", "ERR1", "ERR2", "ERR3", "ERR4", "ERR5", "ERR6", "ERR7", "ERR8", "ERR9", "ERR10", "ERR11", "ERR12", "ERR13(GIC ITS 0)", "ERR14(GIC ITS 1)", "ERR15(GIC ITS 2)", "ERR16(GIC ITS 3)", "ERR17(GIC ITS 4)", "ERR18(GIC ITS 5)", "ERR19(GIC ITS 6)", "ERR20(GIC ITS 7)", }; /*as the SMMU's subtype value is consistent, using switch for type0*/ static char *err_smmu_sub_type(int etype) { switch (etype) { case 0x00: return "TBU0"; case 0x01: return "TBU1"; case 0x02: return "TBU2"; case 0x03: return "TBU3"; case 0x04: return "TBU4"; case 0x05: return "TBU5"; case 0x06: return "TBU6"; case 0x07: return "TBU7"; case 0x08: return "TBU8"; case 0x09: return "TBU9"; case 0x64: return "TCU"; } return "unknown error"; } static const char * const err_pcie_aer_sub_type[] = { "Root Port", "Device", }; /*as the PCIe RASDP's subtype value is consistent, using switch for type0/2*/ static char *err_peci_rasdp_sub_type(int etype) { switch (etype) { case 0x00: return "RCA HB Error"; case 0x01: return "RCB HB Error"; case 0x08: return "RASDP Error"; } return "unknown error"; } static const char * const err_ocm_sub_type[] = { "ERR0", "ERR1", "ERR2", }; static const char * const err_smpro_sub_type[] = { "ERR0", "ERR1", "MPA_ERR", }; static const char * const err_pmpro_sub_type[] = { "ERR0", "ERR1", "MPA_ERR", }; static const char * const err_atf_fw_sub_type[] = { "EL3", "SPM", "Secure Partition(SEL0/SEL1)", }; static const char * const err_smpro_fw_sub_type[] = { "RAS_MSG_ERR", "", }; static const char * const err_pmpro_fw_sub_type[] = { "RAS_MSG_ERR", "", }; static const char * const err_bert_sub_type[] = { "Default", "Watchdog", "ATF Fatal", "SMPRO Fatal", "PMPRO Fatal", }; static char *sqlite3_table_list[] = { "amp_payload0_event_tab", "amp_payload1_event_tab", "amp_payload2_event_tab", "amp_payload3_event_tab", }; struct amp_ras_type_info { int id; const char *name; const char * const *sub; int sub_num; }; static const struct amp_ras_type_info amp_payload_error_type[] = { { .id = AMP_RAS_TYPE_CPU, .name = "CPM", .sub = err_cpm_sub_type, .sub_num = ARRAY_SIZE(err_cpm_sub_type), }, { .id = AMP_RAS_TYPE_MCU, .name = "MCU", .sub = err_mcu_sub_type, .sub_num = ARRAY_SIZE(err_mcu_sub_type), }, { .id = AMP_RAS_TYPE_MESH, .name = "MESH", .sub = err_mesh_sub_type, .sub_num = ARRAY_SIZE(err_mesh_sub_type), }, { .id = AMP_RAS_TYPE_2P_LINK_QS, .name = "2P Link(Altra)", }, { .id = AMP_RAS_TYPE_2P_LINK_MQ, .name = "2P Link(Altra Max)", .sub = err_2p_link_ms_sub_type, .sub_num = ARRAY_SIZE(err_2p_link_ms_sub_type), }, { .id = AMP_RAS_TYPE_GIC, .name = "GIC", .sub = err_gic_sub_type, .sub_num = ARRAY_SIZE(err_gic_sub_type), }, { .id = AMP_RAS_TYPE_SMMU, .name = "SMMU", }, { .id = AMP_RAS_TYPE_PCIE_AER, .name = "PCIe AER", .sub = err_pcie_aer_sub_type, .sub_num = ARRAY_SIZE(err_pcie_aer_sub_type), }, { .id = AMP_RAS_TYPE_PCIE_RASDP, .name = "PCIe RASDP", }, { .id = AMP_RAS_TYPE_OCM, .name = "OCM", .sub = err_ocm_sub_type, .sub_num = ARRAY_SIZE(err_ocm_sub_type), }, { .id = AMP_RAS_TYPE_SMPRO, .name = "SMPRO", .sub = err_smpro_sub_type, .sub_num = ARRAY_SIZE(err_smpro_sub_type), }, { .id = AMP_RAS_TYPE_PMPRO, .name = "PMPRO", .sub = err_pmpro_sub_type, .sub_num = ARRAY_SIZE(err_pmpro_sub_type), }, { .id = AMP_RAS_TYPE_ATF_FW, .name = "ATF FW", .sub = err_atf_fw_sub_type, .sub_num = ARRAY_SIZE(err_atf_fw_sub_type), }, { .id = AMP_RAS_TYPE_SMPRO_FW, .name = "SMPRO FW", .sub = err_smpro_fw_sub_type, .sub_num = ARRAY_SIZE(err_smpro_fw_sub_type), }, { .id = AMP_RAS_TYPE_PMPRO_FW, .name = "PMPRO FW", .sub = err_pmpro_fw_sub_type, .sub_num = ARRAY_SIZE(err_pmpro_fw_sub_type), }, { .id = AMP_RAS_TYPE_BERT, .name = "BERT", .sub = err_bert_sub_type, .sub_num = ARRAY_SIZE(err_bert_sub_type), }, { } }; /*get the error type name*/ static const char *oem_type_name(const struct amp_ras_type_info *info, uint8_t type_id) { const struct amp_ras_type_info *type = &info[0]; for (; type->name; type++) { if (type->id != type_id) continue; return type->name; } return "unknown"; } /*get the error subtype*/ static const char *oem_subtype_name(const struct amp_ras_type_info *info, uint8_t type_id, uint8_t sub_type_id) { const struct amp_ras_type_info *type = &info[0]; for (; type->name; type++) { const char * const *submodule = type->sub; if (type->id != type_id) continue; if (!type->sub) return type->name; if (sub_type_id >= type->sub_num) return "unknown"; return submodule[sub_type_id]; } return "unknown"; } #ifdef HAVE_SQLITE3 /*key pair definition for ampere specific error payload type 0*/ static const struct db_fields amp_payload0_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "type", .type = "TEXT" }, { .name = "subtype", .type = "TEXT" }, { .name = "instance", .type = "INTEGER" }, { .name = "socket_num", .type = "INTEGER" }, { .name = "status_reg", .type = "INTEGER" }, { .name = "addr_reg", .type = "INTEGER" }, { .name = "misc0", .type = "INTEGER" }, { .name = "misc1", .type = "INTEGER" }, { .name = "misc2", .type = "INTEGER" }, { .name = "misc3", .type = "INTEGER" }, }; static const struct db_table_descriptor amp_payload0_event_tab = { .name = "amp_payload0_event", .fields = amp_payload0_event_fields, .num_fields = ARRAY_SIZE(amp_payload0_event_fields), }; /*key pair definition for ampere specific error payload type 1*/ static const struct db_fields amp_payload1_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "type", .type = "TEXT" }, { .name = "subtype", .type = "TEXT" }, { .name = "instance", .type = "INTEGER" }, { .name = "socket_num", .type = "INTEGER" }, { .name = "uncore_err_status", .type = "INTEGER" }, { .name = "uncore_err_mask", .type = "INTEGER" }, { .name = "uncore_err_sev", .type = "INTEGER" }, { .name = "core_err_status", .type = "INTEGER" }, { .name = "core_err_mask", .type = "INTEGER" }, { .name = "root_err_cmd", .type = "INTEGER" }, { .name = "root_err_status", .type = "INTEGER" }, { .name = "src_id", .type = "INTEGER" }, { .name = "reserved1", .type = "INTEGER" }, { .name = "reserverd2", .type = "INTEGER" }, }; static const struct db_table_descriptor amp_payload1_event_tab = { .name = "amp_payload1_event", .fields = amp_payload1_event_fields, .num_fields = ARRAY_SIZE(amp_payload1_event_fields), }; /*key pair definition for ampere specific error payload type 2*/ static const struct db_fields amp_payload2_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "type", .type = "TEXT" }, { .name = "subtype", .type = "TEXT" }, { .name = "instance", .type = "INTEGER" }, { .name = "socket_num", .type = "INTEGER" }, { .name = "ce_report_reg", .type = "INTEGER" }, { .name = "ce_location", .type = "INTEGER" }, { .name = "ce_addr", .type = "INTEGER" }, { .name = "ue_report_reg", .type = "INTEGER" }, { .name = "ue_location", .type = "INTEGER" }, { .name = "ue_addr", .type = "INTEGER" }, { .name = "reserved1", .type = "INTEGER" }, { .name = "reserved2", .type = "INTEGER" }, { .name = "reserved2", .type = "INTEGER" }, }; static const struct db_table_descriptor amp_payload2_event_tab = { .name = "amp_payload2_event", .fields = amp_payload2_event_fields, .num_fields = ARRAY_SIZE(amp_payload2_event_fields), }; /*key pair definition for ampere specific error payload type 3*/ static const struct db_fields amp_payload3_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "type", .type = "TEXT" }, { .name = "subtype", .type = "TEXT" }, { .name = "instance", .type = "INTEGER" }, { .name = "socket_num", .type = "INTEGER" }, { .name = "fw_spec_data0", .type = "INTEGER" }, { .name = "fw_spec_data1", .type = "INTEGER" }, { .name = "fw_spec_data2", .type = "INTEGER" }, { .name = "fw_spec_data3", .type = "INTEGER" }, { .name = "fw_spec_data4", .type = "INTEGER" }, { .name = "fw_spec_data5", .type = "INTEGER" }, }; static const struct db_table_descriptor amp_payload3_event_tab = { .name = "amp_payload3_event", .fields = amp_payload3_event_fields, .num_fields = ARRAY_SIZE(amp_payload3_event_fields), }; /*Save data with different type into sqlite3 db*/ static void record_amp_data(struct ras_ns_ev_decoder *ev_decoder, enum amp_oem_data_type data_type, int id, int64_t data, const char *text) { switch (data_type) { case AMP_OEM_DATA_TYPE_INT: sqlite3_bind_int(ev_decoder->stmt_dec_record, id, data); break; case AMP_OEM_DATA_TYPE_INT64: sqlite3_bind_int64(ev_decoder->stmt_dec_record, id, data); break; case AMP_OEM_DATA_TYPE_TEXT: sqlite3_bind_text(ev_decoder->stmt_dec_record, id, text, -1, NULL); break; default: break; } } static int store_amp_err_data(struct ras_ns_ev_decoder *ev_decoder, const char *name) { int rc; rc = sqlite3_step(ev_decoder->stmt_dec_record); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do %s step on sqlite: error = %d\n", name, rc); rc = sqlite3_reset(ev_decoder->stmt_dec_record); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to reset %s on sqlite: error = %d\n", name, rc); rc = sqlite3_clear_bindings(ev_decoder->stmt_dec_record); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to clear bindings %s on sqlite: error = %d\n", name, rc); return rc; } /*save all Ampere Specific Error Payload type 0 to sqlite3 database*/ static void record_amp_payload0_err(struct ras_ns_ev_decoder *ev_decoder, const char *type_str, const char *subtype_str, const struct amp_payload0_type_sec *err) { if (ev_decoder) { record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_TEXT, AMP_PAYLOAD0_FIELD_TYPE, 0, type_str); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_TEXT, AMP_PAYLOAD0_FIELD_SUB_TYPE, 0, subtype_str); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD0_FIELD_INS, INSTANCE(err->instance), NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD0_FIELD_SOCKET_NUM, SOCKET_NUM(err->instance), NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD0_FIELD_STATUS_REG, err->err_status, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT64, AMP_PAYLOAD0_FIELD_ADDR_REG, err->err_addr, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT64, AMP_PAYLOAD0_FIELD_MISC0, err->err_misc_0, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT64, AMP_PAYLOAD0_FIELD_MISC1, err->err_misc_1, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT64, AMP_PAYLOAD0_FIELD_MISC2, err->err_misc_2, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT64, AMP_PAYLOAD0_FIELD_MISC3, err->err_misc_3, NULL); store_amp_err_data(ev_decoder, "amp_payload0_event_tab"); } } /*save all Ampere Specific Error Payload type 1 to sqlite3 database*/ static void record_amp_payload1_err(struct ras_ns_ev_decoder *ev_decoder, const char *type_str, const char *subtype_str, const struct amp_payload1_type_sec *err) { if (ev_decoder) { record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_TEXT, AMP_PAYLOAD1_FIELD_TYPE, 0, type_str); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_TEXT, AMP_PAYLOAD1_FIELD_SUB_TYPE, 0, subtype_str); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD1_FIELD_INS, INSTANCE(err->instance), NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD1_FIELD_SOCKET_NUM, SOCKET_NUM(err->instance), NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD1_FIELD_UNCORE_ERR_STATUS, err->uncore_status, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD1_FIELD_UNCORE_ERR_MASK, err->uncore_mask, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD1_FIELD_UNCORE_ERR_SEV, err->uncore_sev, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD1_FIELD_CORE_ERR_STATUS, err->core_status, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD1_FIELD_CORE_ERR_MASK, err->core_mask, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD1_FIELD_ROOT_ERR_CMD, err->root_err_cmd, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD1_FIELD_ROOT_ERR_STATUS, err->root_status, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD1_FIELD_SRC_ID, err->src_id, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD1_FIELD_RESERVED1, err->reserved1, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT64, AMP_PAYLOAD1_FIELD_RESERVED2, err->reserved2, NULL); store_amp_err_data(ev_decoder, "amp_payload1_event_tab"); } } /*save all Ampere Specific Error Payload type 2 to sqlite3 database*/ static void record_amp_payload2_err(struct ras_ns_ev_decoder *ev_decoder, const char *type_str, const char *subtype_str, const struct amp_payload2_type_sec *err) { if (ev_decoder) { record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_TEXT, AMP_PAYLOAD2_FIELD_TYPE, 0, type_str); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_TEXT, AMP_PAYLOAD2_FIELD_SUB_TYPE, 0, subtype_str); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD2_FIELD_INS, INSTANCE(err->instance), NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD2_FIELD_SOCKET_NUM, SOCKET_NUM(err->instance), NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD2_FIELD_CE_REPORT_REG, err->ce_register, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD2_FIELD_CE_LOACATION, err->ce_location, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD2_FIELD_CE_ADDR, err->ce_addr, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD2_FIELD_UE_REPORT_REG, err->ue_register, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD2_FIELD_UE_LOCATION, err->ue_location, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD2_FIELD_UE_ADDR, err->ue_addr, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD2_FIELD_RESERVED1, err->reserved1, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT64, AMP_PAYLOAD2_FIELD_RESERVED2, err->reserved2, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT64, AMP_PAYLOAD2_FIELD_RESERVED3, err->reserved3, NULL); store_amp_err_data(ev_decoder, "amp_payload2_event_tab"); } } /*save all Ampere Specific Error Payload type 3 to sqlite3 database*/ static void record_amp_payload3_err(struct ras_ns_ev_decoder *ev_decoder, const char *type_str, const char *subtype_str, const struct amp_payload3_type_sec *err) { if (ev_decoder) { record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_TEXT, AMP_PAYLOAD3_FIELD_TYPE, 0, type_str); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_TEXT, AMP_PAYLOAD3_FIELD_SUB_TYPE, 0, subtype_str); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD3_FIELD_INS, INSTANCE(err->instance), NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD3_FIELD_SOCKET_NUM, SOCKET_NUM(err->instance), NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT, AMP_PAYLOAD3_FIELD_FW_SPEC_DATA0, err->fw_speci_data0, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT64, AMP_PAYLOAD3_FIELD_FW_SPEC_DATA1, err->fw_speci_data1, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT64, AMP_PAYLOAD3_FIELD_FW_SPEC_DATA2, err->fw_speci_data2, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT64, AMP_PAYLOAD3_FIELD_FW_SPEC_DATA3, err->fw_speci_data3, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT64, AMP_PAYLOAD3_FIELD_FW_SPEC_DATA4, err->fw_speci_data4, NULL); record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_INT64, AMP_PAYLOAD3_FIELD_FW_SPEC_DATA5, err->fw_speci_data5, NULL); store_amp_err_data(ev_decoder, "amp_payload3_event_tab"); } } #else static void record_amp_data(struct ras_ns_ev_decoder *ev_decoder, enum amp_oem_data_type data_type, int id, int64_t data, const char *text) { } static void record_amp_payload0_err(struct ras_ns_ev_decoder *ev_decoder, const char *type_str, const char *subtype_str, const struct amp_payload0_type_sec *err) { } static void record_amp_payload1_err(struct ras_ns_ev_decoder *ev_decoder, const char *type_str, const char *subtype_str, const struct amp_payload1_type_sec *err) { } static void record_amp_payload2_err(struct ras_ns_ev_decoder *ev_decoder, const char *type_str, const char *subtype_str, const struct amp_payload2_type_sec *err) { } static void record_amp_payload3_err(struct ras_ns_ev_decoder *ev_decoder, const char *type_str, const char *subtype_str, const struct amp_payload3_type_sec *err) { } static int store_amp_err_data(struct ras_ns_ev_decoder *ev_decoder, char *name) { return 0; } #endif /*decode ampere specific error payload type 0, the CPU's data is save*/ /*to sqlite by ras-arm-handler, others are saved by this function.*/ void decode_amp_payload0_err_regs(struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, const struct amp_payload0_type_sec *err) { char buf[AMP_PAYLOAD0_BUF_LEN]; char *p = buf; char *end = buf + AMP_PAYLOAD0_BUF_LEN; int i = 0, core_num = 0; const char *subtype_str; const char *type_str = oem_type_name(amp_payload_error_type, TYPE(err->type)); if (TYPE(err->type) == AMP_RAS_TYPE_SMMU) subtype_str = err_smmu_sub_type(err->subtype); else subtype_str = oem_subtype_name(amp_payload_error_type, TYPE(err->type), err->subtype); //display error type p += snprintf(p, end - p, " %s", disp_payload1_err_reg_name[i++]); p += snprintf(p, end - p, " %s\n", type_str); //display error subtype p += snprintf(p, end - p, " %s", disp_payload1_err_reg_name[i++]); p += snprintf(p, end - p, " %s\n", subtype_str); //display error instance p += snprintf(p, end - p, " %s", disp_payload1_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", INSTANCE(err->instance)); //display socket number if ((TYPE(err->type) == 0) && ((err->subtype == 0x01) || (err->subtype == 0x02))) { core_num = INSTANCE(err->instance) * 2 + err->subtype - 1; p += snprintf(p, end - p, " %s", disp_payload1_err_reg_name[i++]); p += snprintf(p, end - p, " %d, Core Number is:%d\n", SOCKET_NUM(err->instance), core_num); } else { p += snprintf(p, end - p, " %s", disp_payload1_err_reg_name[i++]); p += snprintf(p, end - p, " %d\n", SOCKET_NUM(err->instance)); } //display status register p += snprintf(p, end - p, " %s", disp_payload0_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", err->err_status); //display address register p += snprintf(p, end - p, " %s", disp_payload0_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%llx\n", (unsigned long long)err->err_addr); //display MISC0 p += snprintf(p, end - p, " %s", disp_payload0_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%llx\n", (unsigned long long)err->err_misc_0); //display MISC1 p += snprintf(p, end - p, " %s", disp_payload0_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%llx\n", (unsigned long long)err->err_misc_1); //display MISC2 p += snprintf(p, end - p, " %s", disp_payload0_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%llx\n", (unsigned long long)err->err_misc_2); //display MISC3 p += snprintf(p, end - p, " %s", disp_payload0_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%llx\n", (unsigned long long)err->err_misc_3); if (p > buf && p < end) { p--; *p = '\0'; } record_amp_payload0_err(ev_decoder, type_str, subtype_str, err); i = 0; p = NULL; end = NULL; trace_seq_printf(s, "%s\n", buf); } /*decode ampere specific error payload type 1 and save to sqlite db*/ static void decode_amp_payload1_err_regs(struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, const struct amp_payload1_type_sec *err) { char buf[AMP_PAYLOAD0_BUF_LEN]; char *p = buf; char *end = buf + AMP_PAYLOAD0_BUF_LEN; int i = 0; const char *type_str = oem_type_name(amp_payload_error_type, TYPE(err->type)); const char *subtype_str = oem_subtype_name(amp_payload_error_type, TYPE(err->type), err->subtype); //display error type p += snprintf(p, end - p, " %s", disp_payload1_err_reg_name[i++]); p += snprintf(p, end - p, " %s\n", type_str); //display error subtype p += snprintf(p, end - p, " %s", disp_payload1_err_reg_name[i++]); p += snprintf(p, end - p, " %s", subtype_str); //display error instance p += snprintf(p, end - p, "\n%s", disp_payload1_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", INSTANCE(err->instance)); //display socket number p += snprintf(p, end - p, " %s", disp_payload1_err_reg_name[i++]); p += snprintf(p, end - p, " %d\n", SOCKET_NUM(err->instance)); //display AER_UNCORR_ERR_STATUS p += snprintf(p, end - p, " %s", disp_payload1_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", err->uncore_status); //display AER_UNCORR_ERR_MASK p += snprintf(p, end - p, " %s", disp_payload1_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", err->uncore_mask); //display AER_UNCORR_ERR_SEV p += snprintf(p, end - p, " %s", disp_payload1_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", err->uncore_sev); //display AER_CORR_ERR_STATUS p += snprintf(p, end - p, " %s", disp_payload1_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", err->core_status); //display AER_CORR_ERR_MASK p += snprintf(p, end - p, " %s", disp_payload1_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", err->core_mask); //display AER_ROOT_ERR_CMD p += snprintf(p, end - p, " %s", disp_payload1_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", err->root_err_cmd); //display AER_ROOT_ERR_STATUS p += snprintf(p, end - p, " %s", disp_payload1_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", err->root_status); //display AER_ERR_SRC_ID p += snprintf(p, end - p, " %s", disp_payload1_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", err->src_id); //display Reserved p += snprintf(p, end - p, " %s", disp_payload1_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", err->reserved1); //display Reserved p += snprintf(p, end - p, " %s", disp_payload1_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%llx\n", (unsigned long long)err->reserved2); if (p > buf && p < end) { p--; *p = '\0'; } record_amp_payload1_err(ev_decoder, type_str, subtype_str, err); i = 0; p = NULL; end = NULL; trace_seq_printf(s, "%s\n", buf); } /*decode ampere specific error payload type 2 and save to sqlite db*/ static void decode_amp_payload2_err_regs(struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, const struct amp_payload2_type_sec *err) { char buf[AMP_PAYLOAD0_BUF_LEN]; char *p = buf; char *end = buf + AMP_PAYLOAD0_BUF_LEN; int i = 0; const char *subtype_str; const char *type_str = oem_type_name(amp_payload_error_type, TYPE(err->type)); if (TYPE(err->type) == AMP_RAS_TYPE_PCIE_RASDP) subtype_str = err_peci_rasdp_sub_type(err->subtype); else subtype_str = oem_subtype_name(amp_payload_error_type, TYPE(err->type), err->subtype); //display error type p += snprintf(p, end - p, " %s", disp_payload2_err_reg_name[i++]); p += snprintf(p, end - p, " %s\n", type_str); //display error subtype p += snprintf(p, end - p, " %s", disp_payload2_err_reg_name[i++]); p += snprintf(p, end - p, " %s\n", subtype_str); //display error instance p += snprintf(p, end - p, " %s", disp_payload2_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", INSTANCE(err->instance)); //display socket number p += snprintf(p, end - p, " %s", disp_payload2_err_reg_name[i++]); p += snprintf(p, end - p, " %d\n", SOCKET_NUM(err->instance)); //display CE Report Register p += snprintf(p, end - p, " %s", disp_payload2_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", err->ce_register); //display CE Location Register p += snprintf(p, end - p, " %s", disp_payload2_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", err->ce_location); //display CE Address p += snprintf(p, end - p, " %s", disp_payload2_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", err->ce_addr); //display UE Reprot Register p += snprintf(p, end - p, " %s", disp_payload2_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", err->ue_register); //display UE Location Register p += snprintf(p, end - p, " %s", disp_payload2_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", err->ue_location); //display UE Address p += snprintf(p, end - p, " %s", disp_payload2_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", err->ue_addr); //display Reserved p += snprintf(p, end - p, " %s", disp_payload2_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", err->reserved1); //display Reserved p += snprintf(p, end - p, " %s", disp_payload2_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%llx\n", (unsigned long long)err->reserved2); //display Reserved p += snprintf(p, end - p, " %s", disp_payload2_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%llx\n", (unsigned long long)err->reserved3); if (p > buf && p < end) { p--; *p = '\0'; } record_amp_payload2_err(ev_decoder, type_str, subtype_str, err); i = 0; p = NULL; end = NULL; trace_seq_printf(s, "%s\n", buf); } /*decode ampere specific error payload type 3 and save to sqlite db*/ static void decode_amp_payload3_err_regs(struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, const struct amp_payload3_type_sec *err) { char buf[AMP_PAYLOAD0_BUF_LEN]; char *p = buf; char *end = buf + AMP_PAYLOAD0_BUF_LEN; int i = 0; const char *type_str = oem_type_name(amp_payload_error_type, TYPE(err->type)); const char *subtype_str = oem_subtype_name(amp_payload_error_type, TYPE(err->type), err->subtype); //display error type p += snprintf(p, end - p, " %s", disp_payload3_err_reg_name[i++]); p += snprintf(p, end - p, " %s\n", type_str); //display error subtype p += snprintf(p, end - p, " %s", disp_payload3_err_reg_name[i++]); p += snprintf(p, end - p, " %s\n", subtype_str); //display error instance p += snprintf(p, end - p, " %s", disp_payload3_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", INSTANCE(err->instance)); //display socket number p += snprintf(p, end - p, " %s", disp_payload3_err_reg_name[i++]); p += snprintf(p, end - p, " %d\n", SOCKET_NUM(err->instance)); //display Firmware-Specific Data 0 p += snprintf(p, end - p, " %s", disp_payload3_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x\n", err->fw_speci_data0); //display Firmware-Specific Data 1 p += snprintf(p, end - p, " %s", disp_payload3_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%llx\n", (unsigned long long)err->fw_speci_data1); //display Firmware-Specific Data 2 p += snprintf(p, end - p, " %s", disp_payload3_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%llx\n", (unsigned long long)err->fw_speci_data2); //display Firmware-Specific Data 3 p += snprintf(p, end - p, " %s", disp_payload3_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%llx\n", (unsigned long long)err->fw_speci_data3); //display Firmware-Specific Data 4 p += snprintf(p, end - p, " %s", disp_payload3_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%llx\n", (unsigned long long)err->fw_speci_data4); //display Firmware-Specific Data 5 p += snprintf(p, end - p, " %s", disp_payload3_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%llx\n", (unsigned long long)err->fw_speci_data5); if (p > buf && p < end) { p--; *p = '\0'; } record_amp_payload3_err(ev_decoder, type_str, subtype_str, err); i = 0; p = NULL; end = NULL; trace_seq_printf(s, "%s\n", buf); } /* error data decoding functions */ static int decode_amp_oem_type_error(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, struct ras_non_standard_event *event) { int payload_type = PAYLOAD_TYPE(event->error[0]); #ifdef HAVE_SQLITE3 struct db_table_descriptor db_tab; int id = 0; if (payload_type == PAYLOAD_TYPE_0) { db_tab = amp_payload0_event_tab; id = AMP_PAYLOAD0_FIELD_TIMESTAMP; } else if (payload_type == PAYLOAD_TYPE_1) { db_tab = amp_payload1_event_tab; id = AMP_PAYLOAD1_FIELD_TIMESTAMP; } else if (payload_type == PAYLOAD_TYPE_2) { db_tab = amp_payload2_event_tab; id = AMP_PAYLOAD2_FIELD_TIMESTAMP; } else if (payload_type == PAYLOAD_TYPE_3) { db_tab = amp_payload3_event_tab; id = AMP_PAYLOAD3_FIELD_TIMESTAMP; } else return -1; if (!ev_decoder->stmt_dec_record) { if (ras_mc_add_vendor_table(ras, &ev_decoder->stmt_dec_record, &db_tab) != SQLITE_OK) { trace_seq_printf(s, "create sql %s fail\n", sqlite3_table_list[payload_type]); return -1; } } record_amp_data(ev_decoder, AMP_OEM_DATA_TYPE_TEXT, id, 0, event->timestamp); #endif if (payload_type == PAYLOAD_TYPE_0) { const struct amp_payload0_type_sec *err = (struct amp_payload0_type_sec *)event->error; decode_amp_payload0_err_regs(ev_decoder, s, err); } else if (payload_type == PAYLOAD_TYPE_1) { const struct amp_payload1_type_sec *err = (struct amp_payload1_type_sec *)event->error; decode_amp_payload1_err_regs(ev_decoder, s, err); } else if (payload_type == PAYLOAD_TYPE_2) { const struct amp_payload2_type_sec *err = (struct amp_payload2_type_sec *)event->error; decode_amp_payload2_err_regs(ev_decoder, s, err); } else if (payload_type == PAYLOAD_TYPE_3) { const struct amp_payload3_type_sec *err = (struct amp_payload3_type_sec *)event->error; decode_amp_payload3_err_regs(ev_decoder, s, err); } else { trace_seq_printf(s, "%s: wrong payload type\n", __func__); return -1; } return 0; } struct ras_ns_ev_decoder amp_ns_oem_decoder[] = { { .sec_type = "e8ed898ddf1643cc8ecc54f060ef157f", .decode = decode_amp_oem_type_error, }, }; static void __attribute__((constructor)) amp_init(void) { register_ns_ev_decoder(amp_ns_oem_decoder); } 0707010000003C000081A400000000000000000000000165C04BE4000011F8000000000000000000000000000000000000003500000000rasdaemon-0.8.0.49.git+f9cb13b/non-standard-ampere.h/* * Copyright (c) 2020, Ampere Computing LLC. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * */ #ifndef __NON_STANDARD_AMPERE_H #define __NON_STANDARD_AMPERE_H #include "ras-events.h" #include <traceevent/event-parse.h> #define SOCKET_NUM(x) ((x >> 14) & 0x3) #define PAYLOAD_TYPE(x) ((x >> 6) & 0x3) #define TYPE(x) (x & 0x3f) #define INSTANCE(x) (x & 0x3fff) #define AMP_PAYLOAD0_BUF_LEN 1024 #define PAYLOAD_TYPE_0 0x00 #define PAYLOAD_TYPE_1 0x01 #define PAYLOAD_TYPE_2 0x02 #define PAYLOAD_TYPE_3 0x03 /* Ampere RAS Error type definitions */ #define AMP_RAS_TYPE_CPU 0 #define AMP_RAS_TYPE_MCU 1 #define AMP_RAS_TYPE_MESH 2 #define AMP_RAS_TYPE_2P_LINK_QS 3 #define AMP_RAS_TYPE_2P_LINK_MQ 4 #define AMP_RAS_TYPE_GIC 5 #define AMP_RAS_TYPE_SMMU 6 #define AMP_RAS_TYPE_PCIE_AER 7 #define AMP_RAS_TYPE_PCIE_RASDP 8 #define AMP_RAS_TYPE_OCM 9 #define AMP_RAS_TYPE_SMPRO 10 #define AMP_RAS_TYPE_PMPRO 11 #define AMP_RAS_TYPE_ATF_FW 12 #define AMP_RAS_TYPE_SMPRO_FW 13 #define AMP_RAS_TYPE_PMPRO_FW 14 #define AMP_RAS_TYPE_BERT 63 /* ARMv8 RAS Compliant Error Record(APEI and BMC Reporting)*/ struct amp_payload0_type_sec { uint8_t type; uint8_t subtype; uint16_t instance; uint32_t err_status; uint64_t err_addr; uint64_t err_misc_0; uint64_t err_misc_1; uint64_t err_misc_2; uint64_t err_misc_3; }; /*PCIe AER format*/ struct amp_payload1_type_sec { uint8_t type; uint8_t subtype; uint16_t instance; uint32_t uncore_status; uint32_t uncore_mask; uint32_t uncore_sev; uint32_t core_status; uint32_t core_mask; uint32_t root_err_cmd; uint32_t root_status; uint32_t src_id; uint32_t reserved1; uint64_t reserved2; }; /*PCIe RAS Data Path(RASDP) format */ struct amp_payload2_type_sec { uint8_t type; uint8_t subtype; uint16_t instance; uint32_t ce_register; uint32_t ce_location; uint32_t ce_addr; uint32_t ue_register; uint32_t ue_location; uint32_t ue_addr; uint32_t reserved1; uint64_t reserved2; uint64_t reserved3; }; /*Firmware-Specific Data(ATF,SMPro, and BERT) */ struct amp_payload3_type_sec { uint8_t type; uint8_t subtype; uint16_t instance; uint32_t fw_speci_data0; uint64_t fw_speci_data1; uint64_t fw_speci_data2; uint64_t fw_speci_data3; uint64_t fw_speci_data4; uint64_t fw_speci_data5; }; enum amp_oem_data_type { AMP_OEM_DATA_TYPE_INT, AMP_OEM_DATA_TYPE_INT64, AMP_OEM_DATA_TYPE_TEXT, }; enum { AMP_PAYLOAD0_FIELD_ID, AMP_PAYLOAD0_FIELD_TIMESTAMP, AMP_PAYLOAD0_FIELD_TYPE, AMP_PAYLOAD0_FIELD_SUB_TYPE, AMP_PAYLOAD0_FIELD_INS, AMP_PAYLOAD0_FIELD_SOCKET_NUM, AMP_PAYLOAD0_FIELD_STATUS_REG, AMP_PAYLOAD0_FIELD_ADDR_REG, AMP_PAYLOAD0_FIELD_MISC0, AMP_PAYLOAD0_FIELD_MISC1, AMP_PAYLOAD0_FIELD_MISC2, AMP_PAYLOAD0_FIELD_MISC3, }; enum { AMP_PAYLOAD1_FIELD_ID, AMP_PAYLOAD1_FIELD_TIMESTAMP, AMP_PAYLOAD1_FIELD_TYPE, AMP_PAYLOAD1_FIELD_SUB_TYPE, AMP_PAYLOAD1_FIELD_INS, AMP_PAYLOAD1_FIELD_SOCKET_NUM, AMP_PAYLOAD1_FIELD_UNCORE_ERR_STATUS, AMP_PAYLOAD1_FIELD_UNCORE_ERR_MASK, AMP_PAYLOAD1_FIELD_UNCORE_ERR_SEV, AMP_PAYLOAD1_FIELD_CORE_ERR_STATUS, AMP_PAYLOAD1_FIELD_CORE_ERR_MASK, AMP_PAYLOAD1_FIELD_ROOT_ERR_CMD, AMP_PAYLOAD1_FIELD_ROOT_ERR_STATUS, AMP_PAYLOAD1_FIELD_SRC_ID, AMP_PAYLOAD1_FIELD_RESERVED1, AMP_PAYLOAD1_FIELD_RESERVED2, }; enum { AMP_PAYLOAD2_FIELD_ID, AMP_PAYLOAD2_FIELD_TIMESTAMP, AMP_PAYLOAD2_FIELD_TYPE, AMP_PAYLOAD2_FIELD_SUB_TYPE, AMP_PAYLOAD2_FIELD_INS, AMP_PAYLOAD2_FIELD_SOCKET_NUM, AMP_PAYLOAD2_FIELD_CE_REPORT_REG, AMP_PAYLOAD2_FIELD_CE_LOACATION, AMP_PAYLOAD2_FIELD_CE_ADDR, AMP_PAYLOAD2_FIELD_UE_REPORT_REG, AMP_PAYLOAD2_FIELD_UE_LOCATION, AMP_PAYLOAD2_FIELD_UE_ADDR, AMP_PAYLOAD2_FIELD_RESERVED1, AMP_PAYLOAD2_FIELD_RESERVED2, AMP_PAYLOAD2_FIELD_RESERVED3, }; enum { AMP_PAYLOAD3_FIELD_ID, AMP_PAYLOAD3_FIELD_TIMESTAMP, AMP_PAYLOAD3_FIELD_TYPE, AMP_PAYLOAD3_FIELD_SUB_TYPE, AMP_PAYLOAD3_FIELD_INS, AMP_PAYLOAD3_FIELD_SOCKET_NUM, AMP_PAYLOAD3_FIELD_FW_SPEC_DATA0, AMP_PAYLOAD3_FIELD_FW_SPEC_DATA1, AMP_PAYLOAD3_FIELD_FW_SPEC_DATA2, AMP_PAYLOAD3_FIELD_FW_SPEC_DATA3, AMP_PAYLOAD3_FIELD_FW_SPEC_DATA4, AMP_PAYLOAD3_FIELD_FW_SPEC_DATA5 }; void decode_amp_payload0_err_regs(struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, const struct amp_payload0_type_sec *err); #endif 0707010000003D000081A400000000000000000000000165C04BE400007116000000000000000000000000000000000000003900000000rasdaemon-0.8.0.49.git+f9cb13b/non-standard-hisi_hip08.c/* * Copyright (c) 2019 Hisilicon Limited. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include "ras-record.h" #include "ras-logger.h" #include "ras-report.h" #include "ras-non-standard-handler.h" #include "non-standard-hisilicon.h" /* HISI OEM error definitions */ /* HISI OEM format1 error definitions */ #define HISI_OEM_MODULE_ID_MN 0 #define HISI_OEM_MODULE_ID_PLL 1 #define HISI_OEM_MODULE_ID_SLLC 2 #define HISI_OEM_MODULE_ID_AA 3 #define HISI_OEM_MODULE_ID_SIOE 4 #define HISI_OEM_MODULE_ID_POE 5 #define HISI_OEM_MODULE_ID_DISP 8 #define HISI_OEM_MODULE_ID_LPC 9 #define HISI_OEM_MODULE_ID_GIC 13 #define HISI_OEM_MODULE_ID_RDE 14 #define HISI_OEM_MODULE_ID_SAS 15 #define HISI_OEM_MODULE_ID_SATA 16 #define HISI_OEM_MODULE_ID_USB 17 #define HISI_OEM_VALID_SOC_ID BIT(0) #define HISI_OEM_VALID_SOCKET_ID BIT(1) #define HISI_OEM_VALID_NIMBUS_ID BIT(2) #define HISI_OEM_VALID_MODULE_ID BIT(3) #define HISI_OEM_VALID_SUB_MODULE_ID BIT(4) #define HISI_OEM_VALID_ERR_SEVERITY BIT(5) #define HISI_OEM_TYPE1_VALID_ERR_MISC_0 BIT(6) #define HISI_OEM_TYPE1_VALID_ERR_MISC_1 BIT(7) #define HISI_OEM_TYPE1_VALID_ERR_MISC_2 BIT(8) #define HISI_OEM_TYPE1_VALID_ERR_MISC_3 BIT(9) #define HISI_OEM_TYPE1_VALID_ERR_MISC_4 BIT(10) #define HISI_OEM_TYPE1_VALID_ERR_ADDR BIT(11) /* HISI OEM format2 error definitions */ #define HISI_OEM_MODULE_ID_SMMU 0 #define HISI_OEM_MODULE_ID_HHA 1 #define HISI_OEM_MODULE_ID_PA 2 #define HISI_OEM_MODULE_ID_HLLC 3 #define HISI_OEM_MODULE_ID_DDRC 4 #define HISI_OEM_MODULE_ID_L3T 5 #define HISI_OEM_MODULE_ID_L3D 6 #define HISI_OEM_TYPE2_VALID_ERR_FR BIT(6) #define HISI_OEM_TYPE2_VALID_ERR_CTRL BIT(7) #define HISI_OEM_TYPE2_VALID_ERR_STATUS BIT(8) #define HISI_OEM_TYPE2_VALID_ERR_ADDR BIT(9) #define HISI_OEM_TYPE2_VALID_ERR_MISC_0 BIT(10) #define HISI_OEM_TYPE2_VALID_ERR_MISC_1 BIT(11) /* HISI PCIe Local error definitions */ #define HISI_PCIE_SUB_MODULE_ID_AP 0 #define HISI_PCIE_SUB_MODULE_ID_TL 1 #define HISI_PCIE_SUB_MODULE_ID_MAC 2 #define HISI_PCIE_SUB_MODULE_ID_DL 3 #define HISI_PCIE_SUB_MODULE_ID_SDI 4 #define HISI_PCIE_LOCAL_VALID_VERSION BIT(0) #define HISI_PCIE_LOCAL_VALID_SOC_ID BIT(1) #define HISI_PCIE_LOCAL_VALID_SOCKET_ID BIT(2) #define HISI_PCIE_LOCAL_VALID_NIMBUS_ID BIT(3) #define HISI_PCIE_LOCAL_VALID_SUB_MODULE_ID BIT(4) #define HISI_PCIE_LOCAL_VALID_CORE_ID BIT(5) #define HISI_PCIE_LOCAL_VALID_PORT_ID BIT(6) #define HISI_PCIE_LOCAL_VALID_ERR_TYPE BIT(7) #define HISI_PCIE_LOCAL_VALID_ERR_SEVERITY BIT(8) #define HISI_PCIE_LOCAL_VALID_ERR_MISC 9 #define HISI_PCIE_LOCAL_ERR_MISC_MAX 33 #define HISI_BUF_LEN 1024 struct hisi_oem_type1_err_sec { uint32_t val_bits; uint8_t version; uint8_t soc_id; uint8_t socket_id; uint8_t nimbus_id; uint8_t module_id; uint8_t sub_module_id; uint8_t err_severity; uint8_t reserv; uint32_t err_misc_0; uint32_t err_misc_1; uint32_t err_misc_2; uint32_t err_misc_3; uint32_t err_misc_4; uint64_t err_addr; }; struct hisi_oem_type2_err_sec { uint32_t val_bits; uint8_t version; uint8_t soc_id; uint8_t socket_id; uint8_t nimbus_id; uint8_t module_id; uint8_t sub_module_id; uint8_t err_severity; uint8_t reserv; uint32_t err_fr_0; uint32_t err_fr_1; uint32_t err_ctrl_0; uint32_t err_ctrl_1; uint32_t err_status_0; uint32_t err_status_1; uint32_t err_addr_0; uint32_t err_addr_1; uint32_t err_misc0_0; uint32_t err_misc0_1; uint32_t err_misc1_0; uint32_t err_misc1_1; }; struct hisi_pcie_local_err_sec { uint64_t val_bits; uint8_t version; uint8_t soc_id; uint8_t socket_id; uint8_t nimbus_id; uint8_t sub_module_id; uint8_t core_id; uint8_t port_id; uint8_t err_severity; uint16_t err_type; uint8_t reserv[2]; uint32_t err_misc[HISI_PCIE_LOCAL_ERR_MISC_MAX]; }; enum { HIP08_OEM_TYPE1_FIELD_ID, HIP08_OEM_TYPE1_FIELD_TIMESTAMP, HIP08_OEM_TYPE1_FIELD_VERSION, HIP08_OEM_TYPE1_FIELD_SOC_ID, HIP08_OEM_TYPE1_FIELD_SOCKET_ID, HIP08_OEM_TYPE1_FIELD_NIMBUS_ID, HIP08_OEM_TYPE1_FIELD_MODULE_ID, HIP08_OEM_TYPE1_FIELD_SUB_MODULE_ID, HIP08_OEM_TYPE1_FIELD_ERR_SEV, HIP08_OEM_TYPE1_FIELD_REGS_DUMP, }; enum { HIP08_OEM_TYPE2_FIELD_ID, HIP08_OEM_TYPE2_FIELD_TIMESTAMP, HIP08_OEM_TYPE2_FIELD_VERSION, HIP08_OEM_TYPE2_FIELD_SOC_ID, HIP08_OEM_TYPE2_FIELD_SOCKET_ID, HIP08_OEM_TYPE2_FIELD_NIMBUS_ID, HIP08_OEM_TYPE2_FIELD_MODULE_ID, HIP08_OEM_TYPE2_FIELD_SUB_MODULE_ID, HIP08_OEM_TYPE2_FIELD_ERR_SEV, HIP08_OEM_TYPE2_FIELD_REGS_DUMP, }; enum { HIP08_PCIE_LOCAL_FIELD_ID, HIP08_PCIE_LOCAL_FIELD_TIMESTAMP, HIP08_PCIE_LOCAL_FIELD_VERSION, HIP08_PCIE_LOCAL_FIELD_SOC_ID, HIP08_PCIE_LOCAL_FIELD_SOCKET_ID, HIP08_PCIE_LOCAL_FIELD_NIMBUS_ID, HIP08_PCIE_LOCAL_FIELD_SUB_MODULE_ID, HIP08_PCIE_LOCAL_FIELD_CORE_ID, HIP08_PCIE_LOCAL_FIELD_PORT_ID, HIP08_PCIE_LOCAL_FIELD_ERR_SEV, HIP08_PCIE_LOCAL_FIELD_ERR_TYPE, HIP08_PCIE_LOCAL_FIELD_REGS_DUMP, }; struct hisi_module_info { int id; const char *name; const char **sub; int sub_num; }; static const char *pll_submodule_name[] = { "TB_PLL0", "TB_PLL1", "TB_PLL2", "TB_PLL3", "TA_PLL0", "TA_PLL1", "TA_PLL2", "TA_PLL3", "NIMBUS_PLL0", "NIMBUS_PLL1", "NIMBUS_PLL2", "NIMBUS_PLL3", "NIMBUS_PLL4", }; static const char *sllc_submodule_name[] = { "TB_SLLC0", "TB_SLLC1", "TB_SLLC2", "TA_SLLC0", "TA_SLLC1", "TA_SLLC2", "NIMBUS_SLLC0", "NIMBUS_SLLC1", }; static const char *sioe_submodule_name[] = { "TB_SIOE0", "TB_SIOE1", "TB_SIOE2", "TB_SIOE3", "TA_SIOE0", "TA_SIOE1", "TA_SIOE2", "TA_SIOE3", "NIMBUS_SIOE0", "NIMBUS_SIOE1", }; static const char *poe_submodule_name[] = { "TB_POE", "TA_POE", }; static const char *disp_submodule_name[] = { "TB_PERI_DISP", "TB_POE_DISP", "TB_GIC_DISP", "TA_PERI_DISP", "TA_POE_DISP", "TA_GIC_DISP", "HAC_DISP", "PCIE_DISP", "IO_MGMT_DISP", "NETWORK_DISP", }; static const char *sas_submodule_name[] = { "SAS0", "SAS1", }; static const struct hisi_module_info hisi_oem_type1_module[] = { { .id = HISI_OEM_MODULE_ID_PLL, .name = "PLL", .sub = pll_submodule_name, .sub_num = ARRAY_SIZE(pll_submodule_name), }, { .id = HISI_OEM_MODULE_ID_SAS, .name = "SAS", .sub = sas_submodule_name, .sub_num = ARRAY_SIZE(sas_submodule_name), }, { .id = HISI_OEM_MODULE_ID_POE, .name = "POE", .sub = poe_submodule_name, .sub_num = ARRAY_SIZE(poe_submodule_name), }, { .id = HISI_OEM_MODULE_ID_SLLC, .name = "SLLC", .sub = sllc_submodule_name, .sub_num = ARRAY_SIZE(sllc_submodule_name), }, { .id = HISI_OEM_MODULE_ID_SIOE, .name = "SIOE", .sub = sioe_submodule_name, .sub_num = ARRAY_SIZE(sioe_submodule_name), }, { .id = HISI_OEM_MODULE_ID_DISP, .name = "DISP", .sub = disp_submodule_name, .sub_num = ARRAY_SIZE(disp_submodule_name), }, { .id = HISI_OEM_MODULE_ID_MN, .name = "MN", }, { .id = HISI_OEM_MODULE_ID_AA, .name = "AA", }, { .id = HISI_OEM_MODULE_ID_LPC, .name = "LPC", }, { .id = HISI_OEM_MODULE_ID_GIC, .name = "GIC", }, { .id = HISI_OEM_MODULE_ID_RDE, .name = "RDE", }, { .id = HISI_OEM_MODULE_ID_SATA, .name = "SATA", }, { .id = HISI_OEM_MODULE_ID_USB, .name = "USB", }, { } }; static const char *smmu_submodule_name[] = { "HAC_SMMU", "PCIE_SMMU", "MGMT_SMMU", "NIC_SMMU", }; static const char *hllc_submodule_name[] = { "HLLC0", "HLLC1", "HLLC2", }; static const char *hha_submodule_name[] = { "TB_HHA0", "TB_HHA1", "TA_HHA0", "TA_HHA1" }; static const char *ddrc_submodule_name[] = { "TB_DDRC0", "TB_DDRC1", "TB_DDRC2", "TB_DDRC3", "TA_DDRC0", "TA_DDRC1", "TA_DDRC2", "TA_DDRC3", }; static const char *l3tag_submodule_name[] = { "TB_PARTITION0", "TB_PARTITION1", "TB_PARTITION2", "TB_PARTITION3", "TB_PARTITION4", "TB_PARTITION5", "TB_PARTITION6", "TB_PARTITION7", "TA_PARTITION0", "TA_PARTITION1", "TA_PARTITION2", "TA_PARTITION3", "TA_PARTITION4", "TA_PARTITION5", "TA_PARTITION6", "TA_PARTITION7", }; static const char *l3data_submodule_name[] = { "TB_BANK0", "TB_BANK1", "TB_BANK2", "TB_BANK3", "TA_BANK0", "TA_BANK1", "TA_BANK2", "TA_BANK3", }; static const struct hisi_module_info hisi_oem_type2_module[] = { { .id = HISI_OEM_MODULE_ID_SMMU, .name = "SMMU", .sub = smmu_submodule_name, .sub_num = ARRAY_SIZE(smmu_submodule_name), }, { .id = HISI_OEM_MODULE_ID_HHA, .name = "HHA", .sub = hha_submodule_name, .sub_num = ARRAY_SIZE(hha_submodule_name), }, { .id = HISI_OEM_MODULE_ID_PA, .name = "PA", }, { .id = HISI_OEM_MODULE_ID_HLLC, .name = "HLLC", .sub = hllc_submodule_name, .sub_num = ARRAY_SIZE(hllc_submodule_name), }, { .id = HISI_OEM_MODULE_ID_DDRC, .name = "DDRC", .sub = ddrc_submodule_name, .sub_num = ARRAY_SIZE(ddrc_submodule_name), }, { .id = HISI_OEM_MODULE_ID_L3T, .name = "L3TAG", .sub = l3tag_submodule_name, .sub_num = ARRAY_SIZE(l3tag_submodule_name), }, { .id = HISI_OEM_MODULE_ID_L3D, .name = "L3DATA", .sub = l3data_submodule_name, .sub_num = ARRAY_SIZE(l3data_submodule_name), }, { } }; static const char *oem_module_name(const struct hisi_module_info *info, uint8_t module_id) { const struct hisi_module_info *module = &info[0]; for (; module->name; module++) { if (module->id != module_id) continue; return module->name; } return "unknown"; } static const char *oem_submodule_name(const struct hisi_module_info *info, uint8_t module_id, uint8_t sub_module_id) { const struct hisi_module_info *module = &info[0]; for (; module->name; module++) { const char **submodule = module->sub; if (module->id != module_id) continue; if (!module->sub) return module->name; if (sub_module_id >= module->sub_num) return "unknown"; return submodule[sub_module_id]; } return "unknown"; } static char *pcie_local_sub_module_name(uint8_t id) { switch (id) { case HISI_PCIE_SUB_MODULE_ID_AP: return "AP_Layer"; case HISI_PCIE_SUB_MODULE_ID_TL: return "TL_Layer"; case HISI_PCIE_SUB_MODULE_ID_MAC: return "MAC_Layer"; case HISI_PCIE_SUB_MODULE_ID_DL: return "DL_Layer"; case HISI_PCIE_SUB_MODULE_ID_SDI: return "SDI_Layer"; default: break; } return "unknown"; } #ifdef HAVE_SQLITE3 static const struct db_fields hip08_oem_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "version", .type = "INTEGER" }, { .name = "soc_id", .type = "INTEGER" }, { .name = "socket_id", .type = "INTEGER" }, { .name = "nimbus_id", .type = "INTEGER" }, { .name = "module_id", .type = "TEXT" }, { .name = "sub_module_id", .type = "TEXT" }, { .name = "err_severity", .type = "TEXT" }, { .name = "regs_dump", .type = "TEXT" }, }; static const struct db_table_descriptor hip08_oem_type1_event_tab = { .name = "hip08_oem_type1_event_v2", .fields = hip08_oem_event_fields, .num_fields = ARRAY_SIZE(hip08_oem_event_fields), }; static const struct db_table_descriptor hip08_oem_type2_event_tab = { .name = "hip08_oem_type2_event_v2", .fields = hip08_oem_event_fields, .num_fields = ARRAY_SIZE(hip08_oem_event_fields), }; static const struct db_fields hip08_pcie_local_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "version", .type = "INTEGER" }, { .name = "soc_id", .type = "INTEGER" }, { .name = "socket_id", .type = "INTEGER" }, { .name = "nimbus_id", .type = "INTEGER" }, { .name = "sub_module_id", .type = "TEXT" }, { .name = "core_id", .type = "INTEGER" }, { .name = "port_id", .type = "INTEGER" }, { .name = "err_severity", .type = "TEXT" }, { .name = "err_type", .type = "INTEGER" }, { .name = "regs_dump", .type = "TEXT" }, }; static const struct db_table_descriptor hip08_pcie_local_event_tab = { .name = "hip08_pcie_local_event_v2", .fields = hip08_pcie_local_event_fields, .num_fields = ARRAY_SIZE(hip08_pcie_local_event_fields), }; #endif #define IN_RANGE(p, start, end) ((p) >= (start) && (p) < (end)) static void decode_oem_type1_err_hdr(struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, const struct hisi_oem_type1_err_sec *err) { char buf[HISI_BUF_LEN]; char *p = buf; char *end = buf + HISI_BUF_LEN; p += snprintf(p, end - p, "[ table_version=%d ", err->version); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HIP08_OEM_TYPE1_FIELD_VERSION, err->version, NULL); if (err->val_bits & HISI_OEM_VALID_SOC_ID && IN_RANGE(p, buf, end)) { p += snprintf(p, end - p, "SOC_ID=%d ", err->soc_id); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HIP08_OEM_TYPE1_FIELD_SOC_ID, err->soc_id, NULL); } if (err->val_bits & HISI_OEM_VALID_SOCKET_ID && IN_RANGE(p, buf, end)) { p += snprintf(p, end - p, "socket_ID=%d ", err->socket_id); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HIP08_OEM_TYPE1_FIELD_SOCKET_ID, err->socket_id, NULL); } if (err->val_bits & HISI_OEM_VALID_NIMBUS_ID && IN_RANGE(p, buf, end)) { p += snprintf(p, end - p, "nimbus_ID=%d ", err->nimbus_id); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HIP08_OEM_TYPE1_FIELD_NIMBUS_ID, err->nimbus_id, NULL); } if (err->val_bits & HISI_OEM_VALID_MODULE_ID && IN_RANGE(p, buf, end)) { const char *str = oem_module_name(hisi_oem_type1_module, err->module_id); p += snprintf(p, end - p, "module=%s ", str); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_TEXT, HIP08_OEM_TYPE1_FIELD_MODULE_ID, 0, str); } if (err->val_bits & HISI_OEM_VALID_SUB_MODULE_ID && IN_RANGE(p, buf, end)) { const char *str = oem_submodule_name(hisi_oem_type1_module, err->module_id, err->sub_module_id); p += snprintf(p, end - p, "submodule=%s ", str); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_TEXT, HIP08_OEM_TYPE1_FIELD_SUB_MODULE_ID, 0, str); } if (err->val_bits & HISI_OEM_VALID_ERR_SEVERITY && IN_RANGE(p, buf, end)) { p += snprintf(p, end - p, "error_severity=%s ", err_severity(err->err_severity)); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_TEXT, HIP08_OEM_TYPE1_FIELD_ERR_SEV, 0, err_severity(err->err_severity)); } if (IN_RANGE(p, buf, end)) p += snprintf(p, end - p, "]"); trace_seq_printf(s, "%s\n", buf); } static void decode_oem_type1_err_regs(struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, const struct hisi_oem_type1_err_sec *err) { char buf[HISI_BUF_LEN]; char *p = buf; char *end = buf + HISI_BUF_LEN; trace_seq_printf(s, "Reg Dump:\n"); if (err->val_bits & HISI_OEM_TYPE1_VALID_ERR_MISC_0) { trace_seq_printf(s, "ERR_MISC0=0x%x\n", err->err_misc_0); p += snprintf(p, end - p, "ERR_MISC0=0x%x ", err->err_misc_0); } if (err->val_bits & HISI_OEM_TYPE1_VALID_ERR_MISC_1 && IN_RANGE(p, buf, end)) { trace_seq_printf(s, "ERR_MISC1=0x%x\n", err->err_misc_1); p += snprintf(p, end - p, "ERR_MISC1=0x%x ", err->err_misc_1); } if (err->val_bits & HISI_OEM_TYPE1_VALID_ERR_MISC_2 && IN_RANGE(p, buf, end)) { trace_seq_printf(s, "ERR_MISC2=0x%x\n", err->err_misc_2); p += snprintf(p, end - p, "ERR_MISC2=0x%x ", err->err_misc_2); } if (err->val_bits & HISI_OEM_TYPE1_VALID_ERR_MISC_3 && IN_RANGE(p, buf, end)) { trace_seq_printf(s, "ERR_MISC3=0x%x\n", err->err_misc_3); p += snprintf(p, end - p, "ERR_MISC3=0x%x ", err->err_misc_3); } if (err->val_bits & HISI_OEM_TYPE1_VALID_ERR_MISC_4 && IN_RANGE(p, buf, end)) { trace_seq_printf(s, "ERR_MISC4=0x%x\n", err->err_misc_4); p += snprintf(p, end - p, "ERR_MISC4=0x%x ", err->err_misc_4); } if (err->val_bits & HISI_OEM_TYPE1_VALID_ERR_ADDR && IN_RANGE(p, buf, end)) { trace_seq_printf(s, "ERR_ADDR=0x%llx\n", (unsigned long long)err->err_addr); p += snprintf(p, end - p, "ERR_ADDR=0x%llx ", (unsigned long long)err->err_addr); } if (p > buf && p < end) { p--; *p = '\0'; } record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_TEXT, HIP08_OEM_TYPE1_FIELD_REGS_DUMP, 0, buf); step_vendor_data_tab(ev_decoder, "hip08_oem_type1_event_tab"); } static int add_hip08_oem_type1_table(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder) { #ifdef HAVE_SQLITE3 if (ras->record_events && !ev_decoder->stmt_dec_record) { if (ras_mc_add_vendor_table(ras, &ev_decoder->stmt_dec_record, &hip08_oem_type1_event_tab) != SQLITE_OK) { log(TERM, LOG_WARNING, "Failed to create sql hip08_oem_type1_event_tab\n"); return -1; } } #endif return 0; } /* error data decoding functions */ static int decode_hip08_oem_type1_error(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, struct ras_non_standard_event *event) { const struct hisi_oem_type1_err_sec *err = (struct hisi_oem_type1_err_sec *)event->error; if (err->val_bits == 0) { trace_seq_printf(s, "%s: no valid error information\n", __func__); return -1; } record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_TEXT, HIP08_OEM_TYPE1_FIELD_TIMESTAMP, 0, event->timestamp); trace_seq_printf(s, "\nHISI HIP08: OEM Type-1 Error\n"); decode_oem_type1_err_hdr(ev_decoder, s, err); decode_oem_type1_err_regs(ev_decoder, s, err); return 0; } static void decode_oem_type2_err_hdr(struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, const struct hisi_oem_type2_err_sec *err) { char buf[HISI_BUF_LEN]; char *p = buf; char *end = buf + HISI_BUF_LEN; p += snprintf(p, end - p, "[ table_version=%d ", err->version); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HIP08_OEM_TYPE2_FIELD_VERSION, err->version, NULL); if (err->val_bits & HISI_OEM_VALID_SOC_ID && IN_RANGE(p, buf, end)) { p += snprintf(p, end - p, "SOC_ID=%d ", err->soc_id); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HIP08_OEM_TYPE2_FIELD_SOC_ID, err->soc_id, NULL); } if (err->val_bits & HISI_OEM_VALID_SOCKET_ID && IN_RANGE(p, buf, end)) { p += snprintf(p, end - p, "socket_ID=%d ", err->socket_id); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HIP08_OEM_TYPE2_FIELD_SOCKET_ID, err->socket_id, NULL); } if (err->val_bits & HISI_OEM_VALID_NIMBUS_ID && IN_RANGE(p, buf, end)) { p += snprintf(p, end - p, "nimbus_ID=%d ", err->nimbus_id); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HIP08_OEM_TYPE2_FIELD_NIMBUS_ID, err->nimbus_id, NULL); } if (err->val_bits & HISI_OEM_VALID_MODULE_ID && IN_RANGE(p, buf, end)) { const char *str = oem_module_name(hisi_oem_type2_module, err->module_id); p += snprintf(p, end - p, "module=%s ", str); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_TEXT, HIP08_OEM_TYPE2_FIELD_MODULE_ID, 0, str); } if (err->val_bits & HISI_OEM_VALID_SUB_MODULE_ID && IN_RANGE(p, buf, end)) { const char *str = oem_submodule_name(hisi_oem_type2_module, err->module_id, err->sub_module_id); p += snprintf(p, end - p, "submodule=%s ", str); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_TEXT, HIP08_OEM_TYPE2_FIELD_SUB_MODULE_ID, 0, str); } if (err->val_bits & HISI_OEM_VALID_ERR_SEVERITY && IN_RANGE(p, buf, end)) { p += snprintf(p, end - p, "error_severity=%s ", err_severity(err->err_severity)); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_TEXT, HIP08_OEM_TYPE2_FIELD_ERR_SEV, 0, err_severity(err->err_severity)); } if (IN_RANGE(p, buf, end)) p += snprintf(p, end - p, "]"); trace_seq_printf(s, "%s\n", buf); } static void decode_oem_type2_err_regs(struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, const struct hisi_oem_type2_err_sec *err) { char buf[HISI_BUF_LEN]; char *p = buf; char *end = buf + HISI_BUF_LEN; trace_seq_printf(s, "Reg Dump:\n"); if (err->val_bits & HISI_OEM_TYPE2_VALID_ERR_FR) { trace_seq_printf(s, "ERR_FR_0=0x%x\n", err->err_fr_0); trace_seq_printf(s, "ERR_FR_1=0x%x\n", err->err_fr_1); p += snprintf(p, end - p, "ERR_FR_0=0x%x ERR_FR_1=0x%x ", err->err_fr_0, err->err_fr_1); } if (err->val_bits & HISI_OEM_TYPE2_VALID_ERR_CTRL && IN_RANGE(p, buf, end)) { trace_seq_printf(s, "ERR_CTRL_0=0x%x\n", err->err_ctrl_0); trace_seq_printf(s, "ERR_CTRL_1=0x%x\n", err->err_ctrl_1); p += snprintf(p, end - p, "ERR_CTRL_0=0x%x ERR_CTRL_1=0x%x ", err->err_ctrl_0, err->err_ctrl_1); } if (err->val_bits & HISI_OEM_TYPE2_VALID_ERR_STATUS && IN_RANGE(p, buf, end)) { trace_seq_printf(s, "ERR_STATUS_0=0x%x\n", err->err_status_0); trace_seq_printf(s, "ERR_STATUS_1=0x%x\n", err->err_status_1); p += snprintf(p, end - p, "ERR_STATUS_0=0x%x ERR_STATUS_1=0x%x ", err->err_status_0, err->err_status_1); } if (err->val_bits & HISI_OEM_TYPE2_VALID_ERR_ADDR && IN_RANGE(p, buf, end)) { trace_seq_printf(s, "ERR_ADDR_0=0x%x\n", err->err_addr_0); trace_seq_printf(s, "ERR_ADDR_1=0x%x\n", err->err_addr_1); p += snprintf(p, end - p, "ERR_ADDR_0=0x%x ERR_ADDR_1=0x%x ", err->err_addr_0, err->err_addr_1); } if (err->val_bits & HISI_OEM_TYPE2_VALID_ERR_MISC_0 && IN_RANGE(p, buf, end)) { trace_seq_printf(s, "ERR_MISC0_0=0x%x\n", err->err_misc0_0); trace_seq_printf(s, "ERR_MISC0_1=0x%x\n", err->err_misc0_1); p += snprintf(p, end - p, "ERR_MISC0_0=0x%x ERR_MISC0_1=0x%x ", err->err_misc0_0, err->err_misc0_1); } if (err->val_bits & HISI_OEM_TYPE2_VALID_ERR_MISC_1 && IN_RANGE(p, buf, end)) { trace_seq_printf(s, "ERR_MISC1_0=0x%x\n", err->err_misc1_0); trace_seq_printf(s, "ERR_MISC1_1=0x%x\n", err->err_misc1_1); p += snprintf(p, end - p, "ERR_MISC1_0=0x%x ERR_MISC1_1=0x%x ", err->err_misc1_0, err->err_misc1_1); } if (p > buf && p < end) { p--; *p = '\0'; } record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_TEXT, HIP08_OEM_TYPE2_FIELD_REGS_DUMP, 0, buf); step_vendor_data_tab(ev_decoder, "hip08_oem_type2_event_tab"); } static int add_hip08_oem_type2_table(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder) { #ifdef HAVE_SQLITE3 if (ras->record_events && !ev_decoder->stmt_dec_record) { if (ras_mc_add_vendor_table(ras, &ev_decoder->stmt_dec_record, &hip08_oem_type2_event_tab) != SQLITE_OK) { log(TERM, LOG_WARNING, "Failed to create sql hip08_oem_type2_event_tab\n"); return -1; } } #endif return 0; } static int decode_hip08_oem_type2_error(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, struct ras_non_standard_event *event) { const struct hisi_oem_type2_err_sec *err = (struct hisi_oem_type2_err_sec *)event->error; if (err->val_bits == 0) { trace_seq_printf(s, "%s: no valid error information\n", __func__); return -1; } record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_TEXT, HIP08_OEM_TYPE2_FIELD_TIMESTAMP, 0, event->timestamp); trace_seq_printf(s, "\nHISI HIP08: OEM Type-2 Error\n"); decode_oem_type2_err_hdr(ev_decoder, s, err); decode_oem_type2_err_regs(ev_decoder, s, err); return 0; } static void decode_pcie_local_err_hdr(struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, const struct hisi_pcie_local_err_sec *err) { char buf[HISI_BUF_LEN]; char *p = buf; char *end = buf + HISI_BUF_LEN; p += snprintf(p, end - p, "[ table_version=%d ", err->version); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HIP08_PCIE_LOCAL_FIELD_VERSION, err->version, NULL); if (err->val_bits & HISI_PCIE_LOCAL_VALID_SOC_ID && IN_RANGE(p, buf, end)) { p += snprintf(p, end - p, "SOC_ID=%d ", err->soc_id); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HIP08_PCIE_LOCAL_FIELD_SOC_ID, err->soc_id, NULL); } if (err->val_bits & HISI_PCIE_LOCAL_VALID_SOCKET_ID && IN_RANGE(p, buf, end)) { p += snprintf(p, end - p, "socket_ID=%d ", err->socket_id); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HIP08_PCIE_LOCAL_FIELD_SOCKET_ID, err->socket_id, NULL); } if (err->val_bits & HISI_PCIE_LOCAL_VALID_NIMBUS_ID && IN_RANGE(p, buf, end)) { p += snprintf(p, end - p, "nimbus_ID=%d ", err->nimbus_id); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HIP08_PCIE_LOCAL_FIELD_NIMBUS_ID, err->nimbus_id, NULL); } if (err->val_bits & HISI_PCIE_LOCAL_VALID_SUB_MODULE_ID && IN_RANGE(p, buf, end)) { p += snprintf(p, end - p, "submodule=%s ", pcie_local_sub_module_name(err->sub_module_id)); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_TEXT, HIP08_PCIE_LOCAL_FIELD_SUB_MODULE_ID, 0, pcie_local_sub_module_name(err->sub_module_id)); } if (err->val_bits & HISI_PCIE_LOCAL_VALID_CORE_ID && IN_RANGE(p, buf, end)) { p += snprintf(p, end - p, "core_ID=core%d ", err->core_id); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HIP08_PCIE_LOCAL_FIELD_CORE_ID, err->core_id, NULL); } if (err->val_bits & HISI_PCIE_LOCAL_VALID_PORT_ID && IN_RANGE(p, buf, end)) { p += snprintf(p, end - p, "port_ID=port%d ", err->port_id); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HIP08_PCIE_LOCAL_FIELD_PORT_ID, err->port_id, NULL); } if (err->val_bits & HISI_PCIE_LOCAL_VALID_ERR_SEVERITY && IN_RANGE(p, buf, end)) { p += snprintf(p, end - p, "error_severity=%s ", err_severity(err->err_severity)); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_TEXT, HIP08_PCIE_LOCAL_FIELD_ERR_SEV, 0, err_severity(err->err_severity)); } if (err->val_bits & HISI_PCIE_LOCAL_VALID_ERR_TYPE && IN_RANGE(p, buf, end)) { p += snprintf(p, end - p, "error_type=0x%x ", err->err_type); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HIP08_PCIE_LOCAL_FIELD_ERR_TYPE, err->err_type, NULL); } if (IN_RANGE(p, buf, end)) p += snprintf(p, end - p, "]"); trace_seq_printf(s, "%s\n", buf); } static void decode_pcie_local_err_regs(struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, const struct hisi_pcie_local_err_sec *err) { char buf[HISI_BUF_LEN]; char *p = buf; char *end = buf + HISI_BUF_LEN; uint32_t i; trace_seq_printf(s, "Reg Dump:\n"); for (i = 0; i < HISI_PCIE_LOCAL_ERR_MISC_MAX; i++) { if (err->val_bits & BIT(HISI_PCIE_LOCAL_VALID_ERR_MISC + i) && IN_RANGE(p, buf, end)) { trace_seq_printf(s, "ERR_MISC_%d=0x%x\n", i, err->err_misc[i]); p += snprintf(p, end - p, "ERR_MISC_%d=0x%x ", i, err->err_misc[i]); } } if (p > buf && p < end) { p--; *p = '\0'; } record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_TEXT, HIP08_PCIE_LOCAL_FIELD_REGS_DUMP, 0, buf); step_vendor_data_tab(ev_decoder, "hip08_pcie_local_event_tab"); } static int add_hip08_pcie_local_table(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder) { #ifdef HAVE_SQLITE3 if (ras->record_events && !ev_decoder->stmt_dec_record) { if (ras_mc_add_vendor_table(ras, &ev_decoder->stmt_dec_record, &hip08_pcie_local_event_tab) != SQLITE_OK) { log(TERM, LOG_WARNING, "Failed to create sql hip08_pcie_local_event_tab\n"); return -1; } } #endif return 0; } static int decode_hip08_pcie_local_error(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, struct ras_non_standard_event *event) { const struct hisi_pcie_local_err_sec *err = (struct hisi_pcie_local_err_sec *)event->error; if (err->val_bits == 0) { trace_seq_printf(s, "%s: no valid error information\n", __func__); return -1; } record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_TEXT, HIP08_PCIE_LOCAL_FIELD_TIMESTAMP, 0, event->timestamp); trace_seq_printf(s, "\nHISI HIP08: PCIe local error\n"); decode_pcie_local_err_hdr(ev_decoder, s, err); decode_pcie_local_err_regs(ev_decoder, s, err); return 0; } static struct ras_ns_ev_decoder hip08_ns_ev_decoder[] = { { .sec_type = "1f8161e1-55d6-41e6-bd10-7afd1dc5f7c5", .add_table = add_hip08_oem_type1_table, .decode = decode_hip08_oem_type1_error, }, { .sec_type = "45534ea6-ce23-4115-8535-e07ab3aef91d", .add_table = add_hip08_oem_type2_table, .decode = decode_hip08_oem_type2_error, }, { .sec_type = "b2889fc9-e7d7-4f9d-a867-af42e98be772", .add_table = add_hip08_pcie_local_table, .decode = decode_hip08_pcie_local_error, }, }; static void __attribute__((constructor)) hip08_init(void) { unsigned int i; for (i = 0; i < ARRAY_SIZE(hip08_ns_ev_decoder); i++) register_ns_ev_decoder(&hip08_ns_ev_decoder[i]); } 0707010000003E000081A400000000000000000000000165C04BE400002D24000000000000000000000000000000000000003800000000rasdaemon-0.8.0.49.git+f9cb13b/non-standard-hisilicon.c/* * Copyright (c) 2020 Hisilicon Limited. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include "ras-record.h" #include "ras-logger.h" #include "ras-report.h" #include "non-standard-hisilicon.h" #define HISI_BUF_LEN 2048 #define HISI_PCIE_INFO_BUF_LEN 256 struct hisi_common_error_section { uint32_t val_bits; uint8_t version; uint8_t soc_id; uint8_t socket_id; uint8_t totem_id; uint8_t nimbus_id; uint8_t subsystem_id; uint8_t module_id; uint8_t submodule_id; uint8_t core_id; uint8_t port_id; uint16_t err_type; struct { uint8_t function; uint8_t device; uint16_t segment; uint8_t bus; uint8_t reserved[3]; } pcie_info; uint8_t err_severity; uint8_t reserved[3]; uint32_t reg_array_size; uint32_t reg_array[]; }; enum { HISI_COMMON_VALID_SOC_ID, HISI_COMMON_VALID_SOCKET_ID, HISI_COMMON_VALID_TOTEM_ID, HISI_COMMON_VALID_NIMBUS_ID, HISI_COMMON_VALID_SUBSYSTEM_ID, HISI_COMMON_VALID_MODULE_ID, HISI_COMMON_VALID_SUBMODULE_ID, HISI_COMMON_VALID_CORE_ID, HISI_COMMON_VALID_PORT_ID, HISI_COMMON_VALID_ERR_TYPE, HISI_COMMON_VALID_PCIE_INFO, HISI_COMMON_VALID_ERR_SEVERITY, HISI_COMMON_VALID_REG_ARRAY_SIZE, }; enum { HISI_COMMON_FIELD_ID, HISI_COMMON_FIELD_TIMESTAMP, HISI_COMMON_FIELD_VERSION, HISI_COMMON_FIELD_SOC_ID, HISI_COMMON_FIELD_SOCKET_ID, HISI_COMMON_FIELD_TOTEM_ID, HISI_COMMON_FIELD_NIMBUS_ID, HISI_COMMON_FIELD_SUB_SYSTEM_ID, HISI_COMMON_FIELD_MODULE_ID, HISI_COMMON_FIELD_SUB_MODULE_ID, HISI_COMMON_FIELD_CORE_ID, HISI_COMMON_FIELD_PORT_ID, HISI_COMMON_FIELD_ERR_TYPE, HISI_COMMON_FIELD_PCIE_INFO, HISI_COMMON_FIELD_ERR_SEVERITY, HISI_COMMON_FIELD_REGS_DUMP, }; struct hisi_event { char error_msg[HISI_BUF_LEN]; char pcie_info[HISI_PCIE_INFO_BUF_LEN]; char reg_msg[HISI_BUF_LEN]; }; #ifdef HAVE_SQLITE3 void record_vendor_data(struct ras_ns_ev_decoder *ev_decoder, enum hisi_oem_data_type data_type, int id, int64_t data, const char *text) { if (!ev_decoder->stmt_dec_record) return; switch (data_type) { case HISI_OEM_DATA_TYPE_INT: sqlite3_bind_int(ev_decoder->stmt_dec_record, id, data); break; case HISI_OEM_DATA_TYPE_INT64: sqlite3_bind_int64(ev_decoder->stmt_dec_record, id, data); break; case HISI_OEM_DATA_TYPE_TEXT: sqlite3_bind_text(ev_decoder->stmt_dec_record, id, text, -1, NULL); break; } } int step_vendor_data_tab(struct ras_ns_ev_decoder *ev_decoder, const char *name) { int rc; if (!ev_decoder->stmt_dec_record) return 0; rc = sqlite3_step(ev_decoder->stmt_dec_record); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do %s step on sqlite: error = %d\n", name, rc); rc = sqlite3_reset(ev_decoder->stmt_dec_record); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to reset %s on sqlite: error = %d\n", name, rc); rc = sqlite3_clear_bindings(ev_decoder->stmt_dec_record); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to clear bindings %s on sqlite: error = %d\n", name, rc); return rc; } #else void record_vendor_data(struct ras_ns_ev_decoder *ev_decoder, enum hisi_oem_data_type data_type, int id, int64_t data, const char *text) { } int step_vendor_data_tab(struct ras_ns_ev_decoder *ev_decoder, const char *name) { return 0; } #endif #ifdef HAVE_SQLITE3 static const struct db_fields hisi_common_section_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "version", .type = "INTEGER" }, { .name = "soc_id", .type = "INTEGER" }, { .name = "socket_id", .type = "INTEGER" }, { .name = "totem_id", .type = "INTEGER" }, { .name = "nimbus_id", .type = "INTEGER" }, { .name = "sub_system_id", .type = "INTEGER" }, { .name = "module_id", .type = "TEXT" }, { .name = "sub_module_id", .type = "INTEGER" }, { .name = "core_id", .type = "INTEGER" }, { .name = "port_id", .type = "INTEGER" }, { .name = "err_type", .type = "INTEGER" }, { .name = "pcie_info", .type = "TEXT" }, { .name = "err_severity", .type = "TEXT" }, { .name = "regs_dump", .type = "TEXT" }, }; static const struct db_table_descriptor hisi_common_section_tab = { .name = "hisi_common_section_v2", .fields = hisi_common_section_fields, .num_fields = ARRAY_SIZE(hisi_common_section_fields), }; #endif static const char *soc_desc[] = { "Kunpeng916", "Kunpeng920", "Kunpeng930", }; static const char *module_name[] = { "MN", "PLL", "SLLC", "AA", "SIOE", "POE", "CPA", "DISP", "GIC", "ITS", "AVSBUS", "CS", "PPU", "SMMU", "PA", "HLLC", "DDRC", "L3TAG", "L3DATA", "PCS", "MATA", "PCIe Local", "SAS", "SATA", "NIC", "RoCE", "USB", "ZIP", "HPRE", "SEC", "RDE", "MEE", "L4D", "Tsensor", "ROH", "BTC", "HILINK", "STARS", "SDMA", "UC", "HBMC", }; static const char *get_soc_desc(uint8_t soc_id) { if (soc_id >= sizeof(soc_desc) / sizeof(char *)) return "unknown"; return soc_desc[soc_id]; } static void decode_module(struct ras_ns_ev_decoder *ev_decoder, struct hisi_event *event, uint8_t module_id) { if (module_id >= sizeof(module_name) / sizeof(char *)) { HISI_SNPRINTF(event->error_msg, "module=unknown(id=%hhu) ", module_id); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_TEXT, HISI_COMMON_FIELD_MODULE_ID, 0, "unknown"); } else { HISI_SNPRINTF(event->error_msg, "module=%s ", module_name[module_id]); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_TEXT, HISI_COMMON_FIELD_MODULE_ID, 0, module_name[module_id]); } } static void decode_hisi_common_section_hdr(struct ras_ns_ev_decoder *ev_decoder, const struct hisi_common_error_section *err, struct hisi_event *event) { HISI_SNPRINTF(event->error_msg, "[ table_version=%hhu", err->version); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HISI_COMMON_FIELD_VERSION, err->version, NULL); if (err->val_bits & BIT(HISI_COMMON_VALID_SOC_ID)) { HISI_SNPRINTF(event->error_msg, "soc=%s", get_soc_desc(err->soc_id)); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HISI_COMMON_FIELD_SOC_ID, err->soc_id, NULL); } if (err->val_bits & BIT(HISI_COMMON_VALID_SOCKET_ID)) { HISI_SNPRINTF(event->error_msg, "socket_id=%hhu", err->socket_id); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HISI_COMMON_FIELD_SOCKET_ID, err->socket_id, NULL); } if (err->val_bits & BIT(HISI_COMMON_VALID_TOTEM_ID)) { HISI_SNPRINTF(event->error_msg, "totem_id=%hhu", err->totem_id); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HISI_COMMON_FIELD_TOTEM_ID, err->totem_id, NULL); } if (err->val_bits & BIT(HISI_COMMON_VALID_NIMBUS_ID)) { HISI_SNPRINTF(event->error_msg, "nimbus_id=%hhu", err->nimbus_id); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HISI_COMMON_FIELD_NIMBUS_ID, err->nimbus_id, NULL); } if (err->val_bits & BIT(HISI_COMMON_VALID_SUBSYSTEM_ID)) { HISI_SNPRINTF(event->error_msg, "subsystem_id=%hhu", err->subsystem_id); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HISI_COMMON_FIELD_SUB_SYSTEM_ID, err->subsystem_id, NULL); } if (err->val_bits & BIT(HISI_COMMON_VALID_MODULE_ID)) decode_module(ev_decoder, event, err->module_id); if (err->val_bits & BIT(HISI_COMMON_VALID_SUBMODULE_ID)) { HISI_SNPRINTF(event->error_msg, "submodule_id=%hhu", err->submodule_id); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HISI_COMMON_FIELD_SUB_MODULE_ID, err->submodule_id, NULL); } if (err->val_bits & BIT(HISI_COMMON_VALID_CORE_ID)) { HISI_SNPRINTF(event->error_msg, "core_id=%hhu", err->core_id); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HISI_COMMON_FIELD_CORE_ID, err->core_id, NULL); } if (err->val_bits & BIT(HISI_COMMON_VALID_PORT_ID)) { HISI_SNPRINTF(event->error_msg, "port_id=%hhu", err->port_id); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HISI_COMMON_FIELD_PORT_ID, err->port_id, NULL); } if (err->val_bits & BIT(HISI_COMMON_VALID_ERR_TYPE)) { HISI_SNPRINTF(event->error_msg, "err_type=%hu", err->err_type); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_INT, HISI_COMMON_FIELD_ERR_TYPE, err->err_type, NULL); } if (err->val_bits & BIT(HISI_COMMON_VALID_PCIE_INFO)) { HISI_SNPRINTF(event->error_msg, "pcie_device_id=%04x:%02x:%02x.%x", err->pcie_info.segment, err->pcie_info.bus, err->pcie_info.device, err->pcie_info.function); HISI_SNPRINTF(event->pcie_info, "%04x:%02x:%02x.%x", err->pcie_info.segment, err->pcie_info.bus, err->pcie_info.device, err->pcie_info.function); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_TEXT, HISI_COMMON_FIELD_PCIE_INFO, 0, event->pcie_info); } if (err->val_bits & BIT(HISI_COMMON_VALID_ERR_SEVERITY)) { HISI_SNPRINTF(event->error_msg, "err_severity=%s", err_severity(err->err_severity)); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_TEXT, HISI_COMMON_FIELD_ERR_SEVERITY, 0, err_severity(err->err_severity)); } HISI_SNPRINTF(event->error_msg, "]"); } static int add_hisi_common_table(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder) { #ifdef HAVE_SQLITE3 if (ras->record_events && !ev_decoder->stmt_dec_record) { if (ras_mc_add_vendor_table(ras, &ev_decoder->stmt_dec_record, &hisi_common_section_tab) != SQLITE_OK) { log(TERM, LOG_WARNING, "Failed to create sql hisi_common_section_tab\n"); return -1; } } #endif return 0; } static int decode_hisi_common_section(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, struct ras_non_standard_event *event) { const struct hisi_common_error_section *err = (struct hisi_common_error_section *)event->error; struct hisi_event hevent; memset(&hevent, 0, sizeof(struct hisi_event)); trace_seq_printf(s, "\nHisilicon Common Error Section:\n"); decode_hisi_common_section_hdr(ev_decoder, err, &hevent); trace_seq_printf(s, "%s\n", hevent.error_msg); if (err->val_bits & BIT(HISI_COMMON_VALID_REG_ARRAY_SIZE) && err->reg_array_size > 0) { unsigned int i; trace_seq_printf(s, "Register Dump:\n"); for (i = 0; i < err->reg_array_size / sizeof(uint32_t); i++) { trace_seq_printf(s, "reg%02u=0x%08x\n", i, err->reg_array[i]); HISI_SNPRINTF(hevent.reg_msg, "reg%02u=0x%08x", i, err->reg_array[i]); } } if (ras->record_events) { record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_TEXT, HISI_COMMON_FIELD_TIMESTAMP, 0, event->timestamp); record_vendor_data(ev_decoder, HISI_OEM_DATA_TYPE_TEXT, HISI_COMMON_FIELD_REGS_DUMP, 0, hevent.reg_msg); step_vendor_data_tab(ev_decoder, "hisi_common_section_tab"); } return 0; } static struct ras_ns_ev_decoder hisi_section_ns_ev_decoder[] = { { .sec_type = "c8b328a8-9917-4af6-9a13-2e08ab2e7586", .add_table = add_hisi_common_table, .decode = decode_hisi_common_section, }, }; static void __attribute__((constructor)) hisi_ns_init(void) { unsigned int i; for (i = 0; i < ARRAY_SIZE(hisi_section_ns_ev_decoder); i++) register_ns_ev_decoder(&hisi_section_ns_ev_decoder[i]); } 0707010000003F000081A400000000000000000000000165C04BE4000004FE000000000000000000000000000000000000003800000000rasdaemon-0.8.0.49.git+f9cb13b/non-standard-hisilicon.h/* * Copyright (c) 2020 Hisilicon Limited. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * */ #ifndef __NON_STANDARD_HISILICON_H #define __NON_STANDARD_HISILICON_H #include "ras-non-standard-handler.h" #include "ras-mc-handler.h" #define HISI_SNPRINTF mce_snprintf #define HISI_ERR_SEVERITY_NFE 0 #define HISI_ERR_SEVERITY_FE 1 #define HISI_ERR_SEVERITY_CE 2 #define HISI_ERR_SEVERITY_NONE 3 enum hisi_oem_data_type { HISI_OEM_DATA_TYPE_INT, HISI_OEM_DATA_TYPE_INT64, HISI_OEM_DATA_TYPE_TEXT, }; /* helper functions */ static inline char *err_severity(uint8_t err_sev) { switch (err_sev) { case HISI_ERR_SEVERITY_NFE: return "recoverable"; case HISI_ERR_SEVERITY_FE: return "fatal"; case HISI_ERR_SEVERITY_CE: return "corrected"; case HISI_ERR_SEVERITY_NONE: return "none"; default: break; } return "unknown"; } void record_vendor_data(struct ras_ns_ev_decoder *ev_decoder, enum hisi_oem_data_type data_type, int id, int64_t data, const char *text); int step_vendor_data_tab(struct ras_ns_ev_decoder *ev_decoder, const char *name); #endif 07070100000040000081A400000000000000000000000165C04BE400008175000000000000000000000000000000000000003A00000000rasdaemon-0.8.0.49.git+f9cb13b/non-standard-jaguarmicro.c/* * Copyright (c) 2023, JaguarMicro * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <stdbool.h> #include "ras-record.h" #include "ras-logger.h" #include "ras-report.h" #include "ras-non-standard-handler.h" #include "non-standard-jaguarmicro.h" #include "ras-mce-handler.h" #define JM_BUF_LEN 256 #define JM_REG_BUF_LEN 2048 #define JM_SNPRINTF mce_snprintf static void record_jm_data(struct ras_ns_ev_decoder *ev_decoder, enum jm_oem_data_type data_type, int id, int64_t data, const char *text); struct jm_event { char error_msg[JM_BUF_LEN]; char reg_msg[JM_REG_BUF_LEN]; }; /*ras_csr_por Payload Type 0*/ static const char * const disp_payload0_err_reg_name[] = { "LOCK_CONTROL:", "LOCK_FUNCTION:", "CFG_RAM_ID:", "ERR_FR_LOW32:", "ERR_FR_HIGH32:", "ERR_CTLR_LOW32:", "ECC_STATUS_LOW32:", "ECC_ADDR_LOW32:", "ECC_ADDR_HIGH32:", "ECC_MISC0_LOW32:", "ECC_MISC0_HIGH32:", "ECC_MISC1_LOW32:", "ECC_MISC1_HIGH32:", "ECC_MISC2_LOW32:", "ECC_MISC2_HIGH32:", }; /*SMMU IP Payload Type 1*/ static const char * const disp_payload1_err_reg_name[] = { "CSR_INT_STATUS:", "ERR_FR:", "ERR_CTLR:", "ERR_STATUS:", "ERR_GEN:", }; /*HAC SRAM, Payload Type 2 */ static const char * const disp_payload2_err_reg_name[] = { "ECC_1BIT_INFO_LOW32:", "ECC_1BIT_INFO_HIGH32:", "ECC_2BIT_INFO_LOW32:", "ECC_2BIT_INFO_HIGH32:", }; /*CMN IP, Payload Type 5 */ static const char * const disp_payload5_err_reg_name[] = { "CFGM_MXP_0:", "CFGM_HNF_0:", "CFGM_HNI_0:", "CFGM_SBSX_0:", "ERR_FR_NS:", "ERR_CTLRR_NS:", "ERR_STATUSR_NS:", "ERR_ADDRR_NS:", "ERR_MISCR_NS:", "ERR_FR:", "ERR_CTLR:", "ERR_STATUS:", "ERR_ADDR:", "ERR_MISC:", }; /*GIC IP, Payload Type 6 */ static const char * const disp_payload6_err_reg_name[] = { "RECORD_ID:", "GICT_ERR_FR:", "GICT_ERR_CTLR:", "GICT_ERR_STATUS:", "GICT_ERR_ADDR:", "GICT_ERR_MISC0:", "GICT_ERR_MISC1:", "GICT_ERRGSR:", }; static const char * const soc_desc[] = { "Corsica1.0", }; /* JaguarMicro sub system definitions */ #define JM_SUB_SYS_CSUB 0 #define JM_SUB_SYS_CMN 1 #define JM_SUB_SYS_DDRH 2 #define JM_SUB_SYS_DDRV 3 #define JM_SUB_SYS_GIC 4 #define JM_SUB_SYS_IOSUB 5 #define JM_SUB_SYS_SCP 6 #define JM_SUB_SYS_MCP 7 #define JM_SUB_SYS_IMU0 8 #define JM_SUB_SYS_DPE 9 #define JM_SUB_SYS_RPE 10 #define JM_SUB_SYS_PSUB 11 #define JM_SUB_SYS_HAC 12 #define JM_SUB_SYS_TCM 13 #define JM_SUB_SYS_IMU1 14 static const char * const subsystem_desc[] = { "N2", "CMN", "DDRH", "DDRV", "GIC", "IOSUB", "SCP", "MCP", "IMU0", "DPE", "RPE", "PSUB", "HAC", "TCM", "IMU1", }; static const char * const cmn_module_desc[] = { "MXP", "HNI", "HNF", "SBSX", "CCG", "HND", }; static const char * const ddr_module_desc[] = { "DDRCtrl", "DDRPHY", "SRAM", }; static const char * const gic_module_desc[] = { "GICIP", "GICSRAM", }; /* JaguarMicro IOSUB sub system module definitions */ #define JM_SUBSYS_IOSUB_MOD_SMMU 0 #define JM_SUBSYS_IOSUB_MOD_NIC450 1 #define JM_SUBSYS_IOSUB_MOD_OTHER 2 static const char * const iosub_module_desc[] = { "SMMU", "NIC450", "OTHER", }; static const char * const scp_module_desc[] = { "SRAM", "WDT", "PLL", }; static const char * const mcp_module_desc[] = { "SRAM", "WDT", }; static const char * const imu_module_desc[] = { "SRAM", "WDT", }; /* JaguarMicro DPE sub system module definitions */ #define JM_SUBSYS_DPE_MOD_EPG 0 #define JM_SUBSYS_DPE_MOD_PIPE 1 #define JM_SUBSYS_DPE_MOD_EMEP 2 #define JM_SUBSYS_DPE_MOD_IMEP 3 #define JM_SUBSYS_DPE_MOD_EPAE 4 #define JM_SUBSYS_DPE_MOD_IPAE 5 #define JM_SUBSYS_DPE_MOD_ETH 6 #define JM_SUBSYS_DPE_MOD_TPG 7 #define JM_SUBSYS_DPE_MOD_MIG 8 #define JM_SUBSYS_DPE_MOD_HIG 9 #define JM_SUBSYS_DPE_MOD_DPETOP 10 #define JM_SUBSYS_DPE_MOD_SMMU 11 static const char * const dpe_module_desc[] = { "EPG", "PIPE", "EMEP", "IMEP", "EPAE", "IPAE", "ETH", "TPG", "MIG", "HIG", "DPETOP", "SMMU", }; /* JaguarMicro RPE sub system module definitions */ #define JM_SUBSYS_RPE_MOD_TOP 0 #define JM_SUBSYS_RPE_MOD_TXP_RXP 1 #define JM_SUBSYS_RPE_MOD_SMMU 2 static const char * const rpe_module_desc[] = { "TOP", "TXP_RXP", "SMMU", }; /* JaguarMicro PSUB sub system module definitions */ #define JM_SUBSYS_PSUB_MOD_PCIE0 0 #define JM_SUBSYS_PSUB_MOD_UP_MIX 1 #define JM_SUBSYS_PSUB_MOD_PCIE1 2 #define JM_SUBSYS_PSUB_MOD_PTOP 3 #define JM_SUBSYS_PSUB_MOD_N2IF 4 #define JM_SUBSYS_PSUB_MOD_VPE0_RAS 5 #define JM_SUBSYS_PSUB_MOD_VPE1_RAS 6 #define JM_SUBSYS_PSUB_MOD_X2RC_SMMU 7 #define JM_SUBSYS_PSUB_MOD_X16RC_SMMU 8 #define JM_SUBSYS_PSUB_MOD_SDMA_SMMU 9 static const char * const psub_module_desc[] = { "PCIE0", "UP_MIX", "PCIE1", "PTOP", "N2IF", "VPE0_RAS", "VPE1_RAS", "X2RC_SMMU", "X16RC_SMMU", "SDMA_SMMU", }; static const char * const hac_module_desc[] = { "SRAM", "SMMU", }; #define JM_SUBSYS_TCM_MOD_SRAM 0 #define JM_SUBSYS_TCM_MOD_SMMU 1 #define JM_SUBSYS_TCM_MOD_IP 2 static const char * const tcm_module_desc[] = { "SRAM", "SMMU", "IP", }; static const char * const iosub_smmu_sub_desc[] = { "TBU", "TCU", }; static const char * const iosub_other_sub_desc[] = { "RAM", }; static const char * const smmu_sub_desc[] = { "TCU", "TBU", }; static const char * const psub_pcie0_sub_desc[] = { "RAS0", "RAS1", }; static const char * const csub_dev_desc[] = { "CORE", }; static const char * const cmn_dev_desc[] = { "NID", }; static const char * const ddr_dev_desc[] = { "CHNL", }; static const char * const default_dev_desc[] = { "DEV", }; static const char *get_jm_soc_desc(uint8_t soc_id) { if (soc_id >= sizeof(soc_desc) / sizeof(char *)) return "unknown"; return soc_desc[soc_id]; } static const char *get_jm_subsystem_desc(uint8_t subsys_id) { if (subsys_id >= sizeof(subsystem_desc) / sizeof(char *)) return "unknown"; return subsystem_desc[subsys_id]; } static const char *get_jm_module_desc(uint8_t subsys_id, uint8_t mod_id) { const char * const*module; int tbl_size; switch (subsys_id) { case JM_SUB_SYS_CMN: module = cmn_module_desc; tbl_size = sizeof(cmn_module_desc) / sizeof(char *); break; case JM_SUB_SYS_DDRH: case JM_SUB_SYS_DDRV: module = ddr_module_desc; tbl_size = sizeof(ddr_module_desc) / sizeof(char *); break; case JM_SUB_SYS_GIC: module = gic_module_desc; tbl_size = sizeof(gic_module_desc) / sizeof(char *); break; case JM_SUB_SYS_IOSUB: module = iosub_module_desc; tbl_size = sizeof(iosub_module_desc) / sizeof(char *); break; case JM_SUB_SYS_SCP: module = scp_module_desc; tbl_size = sizeof(scp_module_desc) / sizeof(char *); break; case JM_SUB_SYS_MCP: module = mcp_module_desc; tbl_size = sizeof(mcp_module_desc) / sizeof(char *); break; case JM_SUB_SYS_IMU0: case JM_SUB_SYS_IMU1: module = imu_module_desc; tbl_size = sizeof(imu_module_desc) / sizeof(char *); break; case JM_SUB_SYS_DPE: module = dpe_module_desc; tbl_size = sizeof(dpe_module_desc) / sizeof(char *); break; case JM_SUB_SYS_RPE: module = rpe_module_desc; tbl_size = sizeof(rpe_module_desc) / sizeof(char *); break; case JM_SUB_SYS_PSUB: module = psub_module_desc; tbl_size = sizeof(psub_module_desc) / sizeof(char *); break; case JM_SUB_SYS_HAC: module = hac_module_desc; tbl_size = sizeof(hac_module_desc) / sizeof(char *); break; case JM_SUB_SYS_TCM: module = tcm_module_desc; tbl_size = sizeof(tcm_module_desc) / sizeof(char *); break; default: module = NULL; break; } if ((!module) || (mod_id >= tbl_size)) return "unknown"; return module[mod_id]; } static const char *get_jm_submod_desc(uint8_t subsys_id, uint8_t mod_id, uint8_t sub_id) { const char * const*sub_module; int tbl_size; if (subsys_id == JM_SUB_SYS_IOSUB && mod_id == JM_SUBSYS_IOSUB_MOD_SMMU) { sub_module = iosub_smmu_sub_desc; tbl_size = sizeof(iosub_smmu_sub_desc) / sizeof(char *); } else if (subsys_id == JM_SUB_SYS_IOSUB && mod_id == JM_SUBSYS_IOSUB_MOD_OTHER) { sub_module = iosub_other_sub_desc; tbl_size = sizeof(iosub_other_sub_desc) / sizeof(char *); } else if (subsys_id == JM_SUB_SYS_DPE && mod_id == JM_SUBSYS_DPE_MOD_SMMU) { sub_module = smmu_sub_desc; tbl_size = sizeof(smmu_sub_desc) / sizeof(char *); } else if (subsys_id == JM_SUB_SYS_RPE && mod_id == JM_SUBSYS_RPE_MOD_SMMU) { sub_module = smmu_sub_desc; tbl_size = sizeof(smmu_sub_desc) / sizeof(char *); } else if (subsys_id == JM_SUB_SYS_PSUB && mod_id == JM_SUBSYS_PSUB_MOD_PCIE0) { sub_module = psub_pcie0_sub_desc; tbl_size = sizeof(psub_pcie0_sub_desc) / sizeof(char *); } else if (subsys_id == JM_SUB_SYS_PSUB && mod_id == JM_SUBSYS_PSUB_MOD_X2RC_SMMU) { sub_module = smmu_sub_desc; tbl_size = sizeof(smmu_sub_desc) / sizeof(char *); } else if (subsys_id == JM_SUB_SYS_PSUB && mod_id == JM_SUBSYS_PSUB_MOD_X16RC_SMMU) { sub_module = smmu_sub_desc; tbl_size = sizeof(smmu_sub_desc) / sizeof(char *); } else if (subsys_id == JM_SUB_SYS_PSUB && mod_id == JM_SUBSYS_PSUB_MOD_SDMA_SMMU) { sub_module = smmu_sub_desc; tbl_size = sizeof(smmu_sub_desc) / sizeof(char *); } else if (subsys_id == JM_SUB_SYS_TCM && mod_id == JM_SUBSYS_TCM_MOD_SMMU) { sub_module = smmu_sub_desc; tbl_size = sizeof(smmu_sub_desc) / sizeof(char *); } else { sub_module = NULL; tbl_size = 0; } if ((!sub_module) || (sub_id >= tbl_size)) return "unknown"; return sub_module[sub_id]; } static const char *get_jm_dev_desc(uint8_t subsys_id, uint8_t mod_id, uint8_t sub_id) { if (subsys_id == JM_SUB_SYS_CSUB) return csub_dev_desc[0]; else if (subsys_id == JM_SUB_SYS_DDRH || subsys_id == JM_SUB_SYS_DDRV) return ddr_dev_desc[0]; else if (subsys_id == JM_SUB_SYS_CMN) return cmn_dev_desc[0]; else return default_dev_desc[0]; } #define JM_ERR_SEVERITY_NFE 0 #define JM_ERR_SEVERITY_FE 1 #define JM_ERR_SEVERITY_CE 2 #define JM_ERR_SEVERITY_NONE 3 /* helper functions */ static inline char *jm_err_severity(uint8_t err_sev) { switch (err_sev) { case JM_ERR_SEVERITY_NFE: return "recoverable"; case JM_ERR_SEVERITY_FE: return "fatal"; case JM_ERR_SEVERITY_CE: return "corrected"; case JM_ERR_SEVERITY_NONE: return "none"; default: break; } return "unknown"; } static void decode_jm_common_sec_head(struct ras_ns_ev_decoder *ev_decoder, const struct jm_common_sec_head *err, struct jm_event *event) { if (err->val_bits & BIT(JM_COMMON_VALID_SOC_ID)) { JM_SNPRINTF(event->error_msg, "[ table_version=%hhu decode_version:%hhu", err->version, PAYLOAD_VERSION); record_jm_data(ev_decoder, JM_OEM_DATA_TYPE_INT, JM_PAYLOAD_FIELD_VERSION, err->version, NULL); } if (err->val_bits & BIT(JM_COMMON_VALID_SOC_ID)) { JM_SNPRINTF(event->error_msg, " soc=%s", get_jm_soc_desc(err->soc_id)); record_jm_data(ev_decoder, JM_OEM_DATA_TYPE_INT, JM_PAYLOAD_FIELD_SOC_ID, err->soc_id, NULL); } if (err->val_bits & BIT(JM_COMMON_VALID_SUBSYSTEM_ID)) { JM_SNPRINTF(event->error_msg, " sub system=%s", get_jm_subsystem_desc(err->subsystem_id)); record_jm_data(ev_decoder, JM_OEM_DATA_TYPE_TEXT, JM_PAYLOAD_FIELD_SUB_SYS, 0, get_jm_subsystem_desc(err->subsystem_id)); } if (err->val_bits & BIT(JM_COMMON_VALID_MODULE_ID)) { JM_SNPRINTF(event->error_msg, " module=%s", get_jm_module_desc(err->subsystem_id, err->module_id)); record_jm_data(ev_decoder, JM_OEM_DATA_TYPE_TEXT, JM_PAYLOAD_FIELD_MODULE, 0, get_jm_module_desc(err->subsystem_id, err->module_id)); record_jm_data(ev_decoder, JM_OEM_DATA_TYPE_INT, JM_PAYLOAD_FIELD_MODULE_ID, err->module_id, NULL); } if (err->val_bits & BIT(JM_COMMON_VALID_SUBMODULE_ID)) { JM_SNPRINTF(event->error_msg, " sub module=%s", get_jm_submod_desc(err->subsystem_id, err->module_id, err->submodule_id)); record_jm_data(ev_decoder, JM_OEM_DATA_TYPE_TEXT, JM_PAYLOAD_FIELD_SUB_MODULE, 0, get_jm_submod_desc(err->subsystem_id, err->module_id, err->submodule_id)); record_jm_data(ev_decoder, JM_OEM_DATA_TYPE_INT, JM_PAYLOAD_FIELD_MODULE_ID, err->submodule_id, NULL); } if (err->val_bits & BIT(JM_COMMON_VALID_DEV_ID)) { JM_SNPRINTF(event->error_msg, " dev=%s", get_jm_dev_desc(err->subsystem_id, err->module_id, err->submodule_id)); record_jm_data(ev_decoder, JM_OEM_DATA_TYPE_TEXT, JM_PAYLOAD_FIELD_DEV, 0, get_jm_dev_desc(err->subsystem_id, err->module_id, err->submodule_id)); record_jm_data(ev_decoder, JM_OEM_DATA_TYPE_INT, JM_PAYLOAD_FIELD_DEV_ID, err->dev_id, NULL); } if (err->val_bits & BIT(JM_COMMON_VALID_ERR_TYPE)) { JM_SNPRINTF(event->error_msg, " err_type=%hu", err->err_type); record_jm_data(ev_decoder, JM_OEM_DATA_TYPE_INT, JM_PAYLOAD_FIELD_ERR_TYPE, err->err_type, NULL); } if (err->val_bits & BIT(JM_COMMON_VALID_ERR_SEVERITY)) { JM_SNPRINTF(event->error_msg, " err_severity=%s", jm_err_severity(err->err_severity)); record_jm_data(ev_decoder, JM_OEM_DATA_TYPE_TEXT, JM_PAYLOAD_FIELD_ERR_SEVERITY, 0, jm_err_severity(err->err_severity)); } JM_SNPRINTF(event->error_msg, "]"); } static void decode_jm_common_sec_tail(struct ras_ns_ev_decoder *ev_decoder, const struct jm_common_sec_tail *err, struct jm_event *event, uint32_t val_bits) { if (val_bits & BIT(JM_COMMON_VALID_REG_ARRAY_SIZE) && err->reg_array_size > 0) { int i; JM_SNPRINTF(event->reg_msg, "Extended Register Dump:"); for (i = 0; i < err->reg_array_size; i++) { JM_SNPRINTF(event->reg_msg, "reg%02d=0x%08x", i, err->reg_array[i]); } } } #ifdef HAVE_SQLITE3 /*key pair definition for jaguar micro specific error payload type 0*/ static const struct db_fields jm_payload0_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "version", .type = "INTEGER" }, { .name = "soc_id", .type = "INTEGER" }, { .name = "subsystem", .type = "TEXT" }, { .name = "module", .type = "TEXT" }, { .name = "module_id", .type = "INTEGER" }, { .name = "sub_module", .type = "TEXT" }, { .name = "submodule_id", .type = "INTEGER" }, { .name = "dev", .type = "TEXT" }, { .name = "dev_id", .type = "INTEGER" }, { .name = "err_type", .type = "INTEGER" }, { .name = "err_severity", .type = "TEXT" }, { .name = "regs_dump", .type = "TEXT" }, }; static const struct db_table_descriptor jm_payload0_event_tab = { .name = "jm_payload0_event", .fields = jm_payload0_event_fields, .num_fields = ARRAY_SIZE(jm_payload0_event_fields), }; /*Save data with different type into sqlite3 db*/ static void record_jm_data(struct ras_ns_ev_decoder *ev_decoder, enum jm_oem_data_type data_type, int id, int64_t data, const char *text) { switch (data_type) { case JM_OEM_DATA_TYPE_INT: sqlite3_bind_int(ev_decoder->stmt_dec_record, id, data); break; case JM_OEM_DATA_TYPE_INT64: sqlite3_bind_int64(ev_decoder->stmt_dec_record, id, data); break; case JM_OEM_DATA_TYPE_TEXT: sqlite3_bind_text(ev_decoder->stmt_dec_record, id, text, -1, NULL); break; default: break; } } static int store_jm_err_data(struct ras_ns_ev_decoder *ev_decoder, const char *tab_name) { int rc; rc = sqlite3_step(ev_decoder->stmt_dec_record); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do step on sqlite. Table = %s error = %d\n", tab_name, rc); rc = sqlite3_reset(ev_decoder->stmt_dec_record); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to reset on sqlite. Table = %s error = %d\n", tab_name, rc); rc = sqlite3_clear_bindings(ev_decoder->stmt_dec_record); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to clear bindings on sqlite. Table = %s error = %d\n", tab_name, rc); return rc; } /*save all JaguarMicro Specific Error Payload type 0 to sqlite3 database*/ static void record_jm_payload_err(struct ras_ns_ev_decoder *ev_decoder, const char *reg_str) { if (ev_decoder) { record_jm_data(ev_decoder, JM_OEM_DATA_TYPE_TEXT, JM_PAYLOAD_FIELD_REGS_DUMP, 0, reg_str); store_jm_err_data(ev_decoder, "jm_payload0_event_tab"); } } #else static void record_jm_data(struct ras_ns_ev_decoder *ev_decoder, enum jm_oem_data_type data_type, int id, int64_t data, const char *text) { } static void record_jm_payload_err(struct ras_ns_ev_decoder *ev_decoder, const char *reg_str) { } #endif /*decode JaguarMicro specific error payload type 0, the CPU's data is save*/ /*to sqlite by ras-arm-handler, others are saved by this function.*/ static void decode_jm_payload0_err_regs(struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, const struct jm_payload0_type_sec *err) { int i = 0; const struct jm_common_sec_head *common_head = &err->common_head; const struct jm_common_sec_tail *common_tail = &err->common_tail; struct jm_event jmevent; memset(&jmevent, 0, sizeof(struct jm_event)); trace_seq_printf(s, "\nJaguar Micro Common Error Section:\n"); decode_jm_common_sec_head(ev_decoder, common_head, &jmevent); trace_seq_printf(s, "%s\n", jmevent.error_msg); //display lock_control JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload0_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->lock_control); //display lock_function JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload0_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->lock_function); //display cfg_ram_id JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload0_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->cfg_ram_id); //display err_fr_low32 JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload0_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->err_fr_low32); //display err_fr_high32 JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload0_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->err_fr_high32); //display err_ctlr_low32 JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload0_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->err_ctlr_low32); //display ecc_status_low32 JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload0_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->ecc_status_low32); //display ecc_addr_low32 JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload0_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->ecc_addr_low32); //display ecc_addr_high32 JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload0_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->ecc_addr_high32); //display ecc_misc0_low32 JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload0_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->ecc_misc0_low32); //display ecc_misc0_high32 JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload0_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->ecc_misc0_high32); //display ecc_misc1_low32 JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload0_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->ecc_misc1_low32); //display ecc_misc1_high32 JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload0_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->ecc_misc1_high32); //display ecc_misc2_Low32 JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload0_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->ecc_misc2_Low32); //display ecc_misc2_high32 JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload0_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x\n", err->ecc_misc2_high32); trace_seq_printf(s, "Register Dump:\n"); decode_jm_common_sec_tail(ev_decoder, common_tail, &jmevent, common_head->val_bits); record_jm_payload_err(ev_decoder, jmevent.reg_msg); trace_seq_printf(s, "%s\n", jmevent.reg_msg); } /*decode JaguarMicro specific error payload type 1 and save to sqlite db*/ static void decode_jm_payload1_err_regs(struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, const struct jm_payload1_type_sec *err) { int i = 0; const struct jm_common_sec_head *common_head = &err->common_head; const struct jm_common_sec_tail *common_tail = &err->common_tail; struct jm_event jmevent; memset(&jmevent, 0, sizeof(struct jm_event)); trace_seq_printf(s, "\nJaguarMicro Common Error Section:\n"); decode_jm_common_sec_head(ev_decoder, common_head, &jmevent); trace_seq_printf(s, "%s\n", jmevent.error_msg); //display smmu csr(Inturrpt status) JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload1_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->smmu_csr); //display ERRFR JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload1_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->errfr); //display ERRCTLR JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload1_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->errctlr); //display ERRSTATUS JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload1_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->errstatus); //display ERRGEN JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload1_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x\n", err->errgen); trace_seq_printf(s, "Register Dump:\n"); decode_jm_common_sec_tail(ev_decoder, common_tail, &jmevent, common_head->val_bits); record_jm_payload_err(ev_decoder, jmevent.reg_msg); trace_seq_printf(s, "%s\n", jmevent.reg_msg); } /*decode JaguarMicro specific error payload type 2 and save to sqlite db*/ static void decode_jm_payload2_err_regs(struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, const struct jm_payload2_type_sec *err) { int i = 0; const struct jm_common_sec_head *common_head = &err->common_head; const struct jm_common_sec_tail *common_tail = &err->common_tail; struct jm_event jmevent; memset(&jmevent, 0, sizeof(struct jm_event)); trace_seq_printf(s, "\nJaguarMicro Common Error Section:\n"); decode_jm_common_sec_head(ev_decoder, common_head, &jmevent); trace_seq_printf(s, "%s\n", jmevent.error_msg); //display ecc_1bit_error_interrupt_low JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload2_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->ecc_1bit_int_low); //display ecc_1bit_error_interrupt_high JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload2_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->ecc_1bit_int_high); //display ecc_2bit_error_interrupt_low JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload2_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x; ", err->ecc_2bit_int_low); //display ecc_2bit_error_interrupt_high JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload2_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%x\n", err->ecc_2bit_int_high); trace_seq_printf(s, "Register Dump:\n"); decode_jm_common_sec_tail(ev_decoder, common_tail, &jmevent, common_head->val_bits); record_jm_payload_err(ev_decoder, jmevent.reg_msg); trace_seq_printf(s, "%s\n", jmevent.reg_msg); } /*decode JaguarMicro specific error payload type 5 and save to sqlite db*/ static void decode_jm_payload5_err_regs(struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, const struct jm_payload5_type_sec *err) { int i = 0; const struct jm_common_sec_head *common_head = &err->common_head; const struct jm_common_sec_tail *common_tail = &err->common_tail; struct jm_event jmevent; memset(&jmevent, 0, sizeof(struct jm_event)); trace_seq_printf(s, "\nJaguarMicro Common Error Section:\n"); decode_jm_common_sec_head(ev_decoder, common_head, &jmevent); trace_seq_printf(s, "%s\n", jmevent.error_msg); //display cfgm_mxp_0 JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload5_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx; ", (unsigned long long)err->cfgm_mxp_0); //display cfgm_hnf_0 JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload5_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx; ", (unsigned long long)err->cfgm_hnf_0); //display cfgm_hni_0 JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload5_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx; ", (unsigned long long)err->cfgm_hni_0); //display cfgm_sbsx_0 JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload5_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx; ", (unsigned long long)err->cfgm_sbsx_0); //display errfr_NS JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload5_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx; ", (unsigned long long)err->errfr_NS); //display errctlrr_NS JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload5_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx; ", (unsigned long long)err->errctlrr_NS); //display errstatusr_NS JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload5_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx; ", (unsigned long long)err->errstatusr_NS); //display erraddrr_NS JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload5_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx; ", (unsigned long long)err->erraddrr_NS); //display errmiscr_NS JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload5_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx; ", (unsigned long long)err->errmiscr_NS); //display errfr JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload5_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx; ", (unsigned long long)err->errfr); //display errctlr JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload5_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx; ", (unsigned long long)err->errctlr); //display errstatus JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload5_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx; ", (unsigned long long)err->errstatus); //display erraddr JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload5_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx; ", (unsigned long long)err->erraddr); //display errmisc JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload5_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx\n", (unsigned long long)err->errmisc); trace_seq_printf(s, "Register Dump:\n"); decode_jm_common_sec_tail(ev_decoder, common_tail, &jmevent, common_head->val_bits); record_jm_payload_err(ev_decoder, jmevent.reg_msg); trace_seq_printf(s, "%s\n", jmevent.reg_msg); } /*decode JaguarMicro specific error payload type 6 and save to sqlite db*/ static void decode_jm_payload6_err_regs(struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, const struct jm_payload6_type_sec *err) { int i = 0; const struct jm_common_sec_head *common_head = &err->common_head; const struct jm_common_sec_tail *common_tail = &err->common_tail; struct jm_event jmevent; memset(&jmevent, 0, sizeof(struct jm_event)); trace_seq_printf(s, "\nJaguarMicro Common Error Section:\n"); decode_jm_common_sec_head(ev_decoder, common_head, &jmevent); trace_seq_printf(s, "%s\n", jmevent.error_msg); //display RECORD_ID JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload6_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx; ", (unsigned long long)err->record_id); //display GICT_ERR_FR JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload6_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx; ", (unsigned long long)err->gict_err_fr); //display GICT_ERR_CTLR JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload6_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx; ", (unsigned long long)err->gict_err_ctlr); //display GICT_ERR_STATUS JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload6_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx; ", (unsigned long long)err->gict_err_status); //display GICT_ERR_ADDR JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload6_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx; ", (unsigned long long)err->gict_err_addr); //display GICT_ERR_MISC0 JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload6_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx; ", (unsigned long long)err->gict_err_misc0); //display GICT_ERR_MISC1 JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload6_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx; ", (unsigned long long)err->gict_err_misc1); //display GICT_ERRGSR JM_SNPRINTF(jmevent.reg_msg, " %s", disp_payload6_err_reg_name[i++]); JM_SNPRINTF(jmevent.reg_msg, " 0x%llx\n", (unsigned long long)err->gict_errgsr); trace_seq_printf(s, "Register Dump:\n"); decode_jm_common_sec_tail(ev_decoder, common_tail, &jmevent, common_head->val_bits); record_jm_payload_err(ev_decoder, jmevent.reg_msg); trace_seq_printf(s, "%s\n", jmevent.reg_msg); } /* error data decoding functions */ static int decode_jm_oem_type_error(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, struct ras_non_standard_event *event, int payload_type) { int id = JM_PAYLOAD_FIELD_TIMESTAMP; record_jm_data(ev_decoder, JM_OEM_DATA_TYPE_TEXT, id, 0, event->timestamp); if (payload_type == PAYLOAD_TYPE_0) { const struct jm_payload0_type_sec *err = (struct jm_payload0_type_sec *)event->error; decode_jm_payload0_err_regs(ev_decoder, s, err); } else if (payload_type == PAYLOAD_TYPE_1) { const struct jm_payload1_type_sec *err = (struct jm_payload1_type_sec *)event->error; decode_jm_payload1_err_regs(ev_decoder, s, err); } else if (payload_type == PAYLOAD_TYPE_2) { const struct jm_payload2_type_sec *err = (struct jm_payload2_type_sec *)event->error; decode_jm_payload2_err_regs(ev_decoder, s, err); } else if (payload_type == PAYLOAD_TYPE_5) { const struct jm_payload5_type_sec *err = (struct jm_payload5_type_sec *)event->error; decode_jm_payload5_err_regs(ev_decoder, s, err); } else if (payload_type == PAYLOAD_TYPE_6) { const struct jm_payload6_type_sec *err = (struct jm_payload6_type_sec *)event->error; decode_jm_payload6_err_regs(ev_decoder, s, err); } else { trace_seq_printf(s, "%s : wrong payload type\n", __func__); log(TERM, LOG_ERR, "%s : wrong payload type\n", __func__); return -1; } return 0; } /* error type0 data decoding functions */ static int decode_jm_oem_type0_error(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, struct ras_non_standard_event *event) { return decode_jm_oem_type_error(ras, ev_decoder, s, event, PAYLOAD_TYPE_0); } /* error type1 data decoding functions */ static int decode_jm_oem_type1_error(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, struct ras_non_standard_event *event) { return decode_jm_oem_type_error(ras, ev_decoder, s, event, PAYLOAD_TYPE_1); } /* error type2 data decoding functions */ static int decode_jm_oem_type2_error(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, struct ras_non_standard_event *event) { return decode_jm_oem_type_error(ras, ev_decoder, s, event, PAYLOAD_TYPE_2); } /* error type5 data decoding functions */ static int decode_jm_oem_type5_error(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, struct ras_non_standard_event *event) { return decode_jm_oem_type_error(ras, ev_decoder, s, event, PAYLOAD_TYPE_5); } /* error type6 data decoding functions */ static int decode_jm_oem_type6_error(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, struct ras_non_standard_event *event) { return decode_jm_oem_type_error(ras, ev_decoder, s, event, PAYLOAD_TYPE_6); } static int add_jm_oem_type0_table(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder) { #ifdef HAVE_SQLITE3 if (ras->record_events && !ev_decoder->stmt_dec_record) { if (ras_mc_add_vendor_table(ras, &ev_decoder->stmt_dec_record, &jm_payload0_event_tab) != SQLITE_OK) { log(TERM, LOG_WARNING, "Failed to create sql jm_payload0_event_tab\n"); return -1; } } #endif return 0; } struct ras_ns_ev_decoder jm_ns_oem_type_decoder[] = { { .sec_type = "82d78ba3-fa14-407a-ba0e-f3ba8170013c", .add_table = add_jm_oem_type0_table, .decode = decode_jm_oem_type0_error, }, { .sec_type = "f9723053-2558-49b1-b58a-1c1a82492a62", .add_table = add_jm_oem_type0_table, .decode = decode_jm_oem_type1_error, }, { .sec_type = "2d31de54-3037-4f24-a283-f69ca1ec0b9a", .add_table = add_jm_oem_type0_table, .decode = decode_jm_oem_type2_error, }, { .sec_type = "dac80d69-0a72-4eba-8114-148ee344af06", .add_table = add_jm_oem_type0_table, .decode = decode_jm_oem_type5_error, }, { .sec_type = "746f06fe-405e-451f-8d09-02e802ed984a", .add_table = add_jm_oem_type0_table, .decode = decode_jm_oem_type6_error, }, }; static void __attribute__((constructor)) jm_init(void) { int i; for (i = 0; i < ARRAY_SIZE(jm_ns_oem_type_decoder); i++) register_ns_ev_decoder(&jm_ns_oem_type_decoder[i]); } 07070100000041000081A400000000000000000000000165C04BE400000E95000000000000000000000000000000000000003A00000000rasdaemon-0.8.0.49.git+f9cb13b/non-standard-jaguarmicro.h/* * Copyright (c) 2023, JaguarMicro * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * */ #ifndef __NON_STANDARD_JAGUAR_H #define __NON_STANDARD_JAGUAR_H #include "ras-events.h" #include <traceevent/event-parse.h> #include "ras-mce-handler.h" #define PAYLOAD_TYPE_0 0x00 #define PAYLOAD_TYPE_1 0x01 #define PAYLOAD_TYPE_2 0x02 #define PAYLOAD_TYPE_3 0x03 #define PAYLOAD_TYPE_4 0x04 #define PAYLOAD_TYPE_5 0x05 #define PAYLOAD_TYPE_6 0x06 #define PAYLOAD_TYPE_7 0x07 #define PAYLOAD_TYPE_8 0x08 #define PAYLOAD_TYPE_9 0x09 #define PAYLOAD_VERSION 0 enum { JM_COMMON_VALID_VERSION = 0, JM_COMMON_VALID_SOC_ID, JM_COMMON_VALID_SUBSYSTEM_ID, JM_COMMON_VALID_MODULE_ID, JM_COMMON_VALID_SUBMODULE_ID, JM_COMMON_VALID_DEV_ID, JM_COMMON_VALID_ERR_TYPE, JM_COMMON_VALID_ERR_SEVERITY, JM_COMMON_VALID_REG_ARRAY_SIZE = 11, }; struct jm_common_sec_head { uint32_t val_bits; uint8_t version; uint8_t soc_id; uint8_t subsystem_id; uint8_t module_id; uint8_t submodule_id; uint8_t dev_id; uint16_t err_type; uint8_t err_severity; uint8_t reserved[3]; }; struct jm_common_sec_tail { uint32_t reg_array_size; uint32_t reg_array[]; }; /* ras_csr_por*/ struct jm_payload0_type_sec { struct jm_common_sec_head common_head; uint32_t lock_control; uint32_t lock_function; uint32_t cfg_ram_id; uint32_t err_fr_low32; uint32_t err_fr_high32; uint32_t err_ctlr_low32; uint32_t ecc_status_low32; uint32_t ecc_addr_low32; uint32_t ecc_addr_high32; uint32_t ecc_misc0_low32; uint32_t ecc_misc0_high32; uint32_t ecc_misc1_low32; uint32_t ecc_misc1_high32; uint32_t ecc_misc2_Low32; uint32_t ecc_misc2_high32; struct jm_common_sec_tail common_tail; }; /*SMMU IP*/ struct jm_payload1_type_sec { struct jm_common_sec_head common_head; uint32_t smmu_csr; uint32_t errfr; uint32_t errctlr; uint32_t errstatus; uint32_t errgen; struct jm_common_sec_tail common_tail; }; /*HAC SRAM */ struct jm_payload2_type_sec { struct jm_common_sec_head common_head; uint32_t ecc_1bit_int_low; uint32_t ecc_1bit_int_high; uint32_t ecc_2bit_int_low; uint32_t ecc_2bit_int_high; struct jm_common_sec_tail common_tail; }; /*CMN IP */ struct jm_payload5_type_sec { struct jm_common_sec_head common_head; uint64_t cfgm_mxp_0; uint64_t cfgm_hnf_0; uint64_t cfgm_hni_0; uint64_t cfgm_sbsx_0; uint64_t errfr_NS; uint64_t errctlrr_NS; uint64_t errstatusr_NS; uint64_t erraddrr_NS; uint64_t errmiscr_NS; uint64_t errfr; uint64_t errctlr; uint64_t errstatus; uint64_t erraddr; uint64_t errmisc; struct jm_common_sec_tail common_tail; }; /*GIC IP */ struct jm_payload6_type_sec { struct jm_common_sec_head common_head; uint64_t record_id; uint64_t gict_err_fr; uint64_t gict_err_ctlr; uint64_t gict_err_status; uint64_t gict_err_addr; uint64_t gict_err_misc0; uint64_t gict_err_misc1; uint64_t gict_errgsr; struct jm_common_sec_tail common_tail; }; enum jm_oem_data_type { JM_OEM_DATA_TYPE_INT, JM_OEM_DATA_TYPE_INT64, JM_OEM_DATA_TYPE_TEXT, }; enum { JM_PAYLOAD_FIELD_ID, JM_PAYLOAD_FIELD_TIMESTAMP, JM_PAYLOAD_FIELD_VERSION, JM_PAYLOAD_FIELD_SOC_ID, JM_PAYLOAD_FIELD_SUB_SYS, JM_PAYLOAD_FIELD_MODULE, JM_PAYLOAD_FIELD_MODULE_ID, JM_PAYLOAD_FIELD_SUB_MODULE, JM_PAYLOAD_FIELD_SUBMODULE_ID, JM_PAYLOAD_FIELD_DEV, JM_PAYLOAD_FIELD_DEV_ID, JM_PAYLOAD_FIELD_ERR_TYPE, JM_PAYLOAD_FIELD_ERR_SEVERITY, JM_PAYLOAD_FIELD_REGS_DUMP, }; #define JM_SNPRINTF mce_snprintf #endif 07070100000042000081A400000000000000000000000165C04BE400001980000000000000000000000000000000000000003500000000rasdaemon-0.8.0.49.git+f9cb13b/non-standard-yitian.c/* * Copyright (C) 2023 Alibaba Inc * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <stdbool.h> #include "ras-record.h" #include "ras-logger.h" #include "ras-report.h" #include "ras-non-standard-handler.h" #include "non-standard-yitian.h" static const char * const yitian_ddr_payload_err_reg_name[] = { "Error Type:", "Error SubType:", "Error Instance:", "ECCCFG0:", "ECCCFG1:", "ECCSTAT:", "ECCERRCNT:", "ECCCADDR0:", "ECCCADDR1:", "ECCCSYN0:", "ECCCSYN1:", "ECCCSYN2:", "ECCUADDR0:", "ECCUADDR1:", "ECCUSYN0:", "ECCUSYN1:", "ECCUSYN2:", "ECCBITMASK0:", "ECCBITMASK1:", "ECCBITMASK2:", "ADVECCSTAT:", "ECCAPSTAT:", "ECCCDATA0:", "ECCCDATA1:", "ECCUDATA0:", "ECCUDATA1:", "ECCSYMBOL:", "ECCERRCNTCTL:", "ECCERRCNTSTAT:", "ECCERRCNT0:", "ECCERRCNT1:", "RESERVED0:", "RESERVED1:", "RESERVED2:", }; struct yitian_ras_type_info { int id; const char *name; const char * const *sub; int sub_num; }; static const struct yitian_ras_type_info yitian_payload_error_type[] = { { .id = YITIAN_RAS_TYPE_DDR, .name = "DDR", }, { } }; #ifdef HAVE_SQLITE3 static const struct db_fields yitian_ddr_payload_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "address", .type = "INTEGER" }, { .name = "regs_dump", .type = "TEXT" }, }; static const struct db_table_descriptor yitian_ddr_payload_section_tab = { .name = "yitian_ddr_reg_dump_event", .fields = yitian_ddr_payload_fields, .num_fields = ARRAY_SIZE(yitian_ddr_payload_fields), }; int record_yitian_ddr_reg_dump_event(struct ras_ns_ev_decoder *ev_decoder, struct ras_yitian_ddr_payload_event *ev) { int rc; struct sqlite3_stmt *stmt = ev_decoder->stmt_dec_record; log(TERM, LOG_INFO, "yitian_ddr_reg_dump_event store: %p\n", stmt); sqlite3_bind_text(stmt, 1, ev->timestamp, -1, NULL); sqlite3_bind_int64(stmt, 2, ev->address); sqlite3_bind_text(stmt, 3, ev->reg_msg, -1, NULL); rc = sqlite3_step(stmt); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do yitian_ddr_reg_dump_event step on sqlite: error = %d\n", rc); rc = sqlite3_reset(stmt); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed reset yitian_ddr_reg_dump_event on sqlite: error = %d\n", rc); log(TERM, LOG_INFO, "register inserted at db\n"); return rc; } #endif static const char *oem_type_name(const struct yitian_ras_type_info *info, uint8_t type_id) { const struct yitian_ras_type_info *type = &info[0]; for (; type->name; type++) { if (type->id != type_id) continue; return type->name; } return "unknown"; } static const char *oem_subtype_name(const struct yitian_ras_type_info *info, uint8_t type_id, uint8_t sub_type_id) { const struct yitian_ras_type_info *type = &info[0]; for (; type->name; type++) { const char * const *submodule = type->sub; if (type->id != type_id) continue; if (!type->sub) return type->name; if (sub_type_id >= type->sub_num) return "unknown"; return submodule[sub_type_id]; } return "unknown"; } void decode_yitian_ddr_payload_err_regs(struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, const struct yitian_ddr_payload_type_sec *err, struct ras_events *ras) { char buf[1024]; char *p = buf; char *end = buf + 1024; int i = 0; const struct yitian_payload_header *header = &err->header; uint32_t *pstart; time_t now; struct tm *tm; struct ras_yitian_ddr_payload_event ev; const char *type_str = oem_type_name(yitian_payload_error_type, header->type); const char *subtype_str = oem_subtype_name(yitian_payload_error_type, header->type, header->subtype); now = time(NULL); tm = localtime(&now); if (tm) strftime(ev.timestamp, sizeof(ev.timestamp), "%Y-%m-%d %H:%M:%S %z", tm); //display error type p += snprintf(p, end - p, " %s", yitian_ddr_payload_err_reg_name[i++]); p += snprintf(p, end - p, " %s,", type_str); //display error subtype p += snprintf(p, end - p, " %s", yitian_ddr_payload_err_reg_name[i++]); p += snprintf(p, end - p, " %s,", subtype_str); //display error instance p += snprintf(p, end - p, " %s", yitian_ddr_payload_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x,", header->instance); //display reg dump for (pstart = (uint32_t *)&err->ecccfg0; (void *)pstart < (void *)(err + 1); pstart += 1) { p += snprintf(p, end - p, " %s", yitian_ddr_payload_err_reg_name[i++]); p += snprintf(p, end - p, " 0x%x ", *pstart); } if (p > buf && p < end) { p--; *p = '\0'; } ev.reg_msg = malloc(p - buf + 1); memcpy(ev.reg_msg, buf, p - buf + 1); ev.address = 0; i = 0; p = NULL; end = NULL; trace_seq_printf(s, "%s\n", buf); #ifdef HAVE_SQLITE3 record_yitian_ddr_reg_dump_event(ev_decoder, &ev); #endif } static int add_yitian_common_table(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder) { #ifdef HAVE_SQLITE3 if (ras->record_events && !ev_decoder->stmt_dec_record) { if (ras_mc_add_vendor_table(ras, &ev_decoder->stmt_dec_record, &yitian_ddr_payload_section_tab) != SQLITE_OK) { log(TERM, LOG_WARNING, "Failed to create sql yitian_ddr_payload_section_tab\n"); return -1; } } #endif return 0; } /* error data decoding functions */ static int decode_yitian710_ns_error(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, struct ras_non_standard_event *event) { int payload_type = event->error[0]; if (payload_type == YITIAN_RAS_TYPE_DDR) { const struct yitian_ddr_payload_type_sec *err = (struct yitian_ddr_payload_type_sec *)event->error; decode_yitian_ddr_payload_err_regs(ev_decoder, s, err, ras); } else { trace_seq_printf(s, "%s: wrong payload type\n", __func__); return -1; } return 0; } struct ras_ns_ev_decoder yitian_ns_oem_decoder[] = { { .sec_type = "a6980811-16ea-4e4d-b936-fb00a23ff29c", .add_table = add_yitian_common_table, .decode = decode_yitian710_ns_error, }, }; static void __attribute__((constructor)) yitian_ns_init(void) { int i; for (i = 0; i < ARRAY_SIZE(yitian_ns_oem_decoder); i++) register_ns_ev_decoder(&yitian_ns_oem_decoder[i]); } 07070100000043000081A400000000000000000000000165C04BE4000006F3000000000000000000000000000000000000003500000000rasdaemon-0.8.0.49.git+f9cb13b/non-standard-yitian.h/* * Copyright (C) 2023 Alibaba Inc * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * */ #ifndef __NON_STANDARD_YITIAN_H #define __NON_STANDARD_YITIAN_H #include "ras-events.h" #include "traceevent/event-parse.h" #define YITIAN_RAS_TYPE_DDR 0x50 struct yitian_payload_header { uint8_t type; uint8_t subtype; uint16_t instance; }; struct yitian_ddr_payload_type_sec { struct yitian_payload_header header; uint32_t ecccfg0; uint32_t ecccfg1; uint32_t eccstat; uint32_t eccerrcnt; uint32_t ecccaddr0; uint32_t ecccaddr1; uint32_t ecccsyn0; uint32_t ecccsyn1; uint32_t ecccsyn2; uint32_t eccuaddr0; uint32_t eccuaddr1; uint32_t eccusyn0; uint32_t eccusyn1; uint32_t eccusyn2; uint32_t eccbitmask0; uint32_t eccbitmask1; uint32_t eccbitmask2; uint32_t adveccstat; uint32_t eccapstat; uint32_t ecccdata0; uint32_t ecccdata1; uint32_t eccudata0; uint32_t eccudata1; uint32_t eccsymbol; uint32_t eccerrcntctl; uint32_t eccerrcntstat; uint32_t eccerrcnt0; uint32_t eccerrcnt1; uint32_t reserved0; uint32_t reserved1; uint32_t reserved2; }; struct ras_yitian_ddr_payload_event { char timestamp[64]; unsigned long long address; char *reg_msg; }; int record_yitian_ddr_reg_dump_event(struct ras_ns_ev_decoder *ev_decoder, struct ras_yitian_ddr_payload_event *ev); void decode_yitian_ddr_payload_err_regs(struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, const struct yitian_ddr_payload_type_sec *err, struct ras_events *ras); #endif 07070100000044000081A400000000000000000000000165C04BE400000907000000000000000000000000000000000000002700000000rasdaemon-0.8.0.49.git+f9cb13b/queue.c/* * Copyright (c) Huawei Technologies Co., Ltd. 2021-2021. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. */ #include <stdio.h> #include <stdlib.h> #include "queue.h" #include "ras-logger.h" int is_empty(struct link_queue *queue) { if (queue) return queue->size == 0; return 1; } struct link_queue *init_queue(void) { struct link_queue *queue = NULL; queue = (struct link_queue *)malloc(sizeof(struct link_queue)); if (!queue) { log(TERM, LOG_ERR, "Failed to allocate memory for queue.\n"); return NULL; } queue->size = 0; queue->head = NULL; queue->tail = NULL; return queue; } void clear_queue(struct link_queue *queue) { if (!queue) return; struct queue_node *node = queue->head; struct queue_node *tmp = NULL; while (node) { tmp = node; node = node->next; free(tmp); } queue->head = NULL; queue->tail = NULL; queue->size = 0; } void free_queue(struct link_queue *queue) { clear_queue(queue); if (queue) free(queue); } /* It should be guranteed that the param is not NULL */ void push(struct link_queue *queue, struct queue_node *node) { /* there is no element in the queue */ if (!queue->head) queue->head = node; else queue->tail->next = node; queue->tail = node; (queue->size)++; } int pop(struct link_queue *queue) { struct queue_node *tmp = NULL; if (!queue || is_empty(queue)) return -1; tmp = queue->head; queue->head = queue->head->next; free(tmp); (queue->size)--; return 0; } struct queue_node *front(struct link_queue *queue) { if (!queue) return NULL; return queue->head; } struct queue_node *node_create(time_t time, unsigned int value) { struct queue_node *node = NULL; node = (struct queue_node *)malloc(sizeof(struct queue_node)); if (node) { node->time = time; node->value = value; node->next = NULL; } return node; } 07070100000045000081A400000000000000000000000165C04BE4000004AA000000000000000000000000000000000000002700000000rasdaemon-0.8.0.49.git+f9cb13b/queue.h/* * Copyright (c) Huawei Technologies Co., Ltd. 2021-2021. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. */ #ifndef __RAS_QUEUE_H #define __RAS_QUEUE_H struct queue_node { time_t time; unsigned int value; struct queue_node *next; }; struct link_queue { struct queue_node *head; struct queue_node *tail; int size; }; int is_empty(struct link_queue *queue); struct link_queue *init_queue(void); void clear_queue(struct link_queue *queue); void free_queue(struct link_queue *queue); void push(struct link_queue *queue, struct queue_node *node); int pop(struct link_queue *queue); struct queue_node *front(struct link_queue *queue); struct queue_node *node_create(time_t time, unsigned int value); #endif 07070100000046000081A400000000000000000000000165C04BE400001587000000000000000000000000000000000000003100000000rasdaemon-0.8.0.49.git+f9cb13b/ras-aer-handler.c/* * Copyright (C) 2013 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <traceevent/kbuffer.h> #include "ras-aer-handler.h" #include "ras-record.h" #include "ras-logger.h" #include "bitfield.h" #include "ras-report.h" /* bit field meaning for correctable error */ static const char *aer_cor_errors[32] = { /* Correctable errors */ [0] = "Receiver Error", [6] = "Bad TLP", [7] = "Bad DLLP", [8] = "RELAY_NUM Rollover", [12] = "Replay Timer Timeout", [13] = "Advisory Non-Fatal", }; /* bit field meaning for uncorrectable error */ static const char *aer_uncor_errors[32] = { /* Uncorrectable errors */ [4] = "Data Link Protocol", [12] = "Poisoned TLP", [13] = "Flow Control Protocol", [14] = "Completion Timeout", [15] = "Completer Abort", [16] = "Unexpected Completion", [17] = "Receiver Overflow", [18] = "Malformed TLP", [19] = "ECRC", [20] = "Unsupported Request", }; #define BUF_LEN 1024 int ras_aer_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) { int len; unsigned long long severity_val; unsigned long long status_val; unsigned long long val; struct ras_events *ras = context; time_t now; struct tm *tm; struct ras_aer_event ev; char buf[BUF_LEN]; char ipmi_add_sel[105]; uint8_t sel_data[5]; int seg, bus, dev, fn; /* * Newer kernels (3.10-rc1 or upper) provide an uptime clock. * On previous kernels, the way to properly generate an event would * be to inject a fake one, measure its timestamp and diff it against * gettimeofday. We won't do it here. Instead, let's use uptime, * falling-back to the event report's time, if "uptime" clock is * not available (legacy kernels). */ if (ras->use_uptime) now = record->ts / user_hz + ras->uptime_diff; else now = time(NULL); tm = localtime(&now); if (tm) strftime(ev.timestamp, sizeof(ev.timestamp), "%Y-%m-%d %H:%M:%S %z", tm); trace_seq_printf(s, "%s ", ev.timestamp); ev.dev_name = tep_get_field_raw(s, event, "dev_name", record, &len, 1); if (!ev.dev_name) return -1; trace_seq_printf(s, "%s ", ev.dev_name); if (tep_get_field_val(s, event, "status", record, &status_val, 1) < 0) return -1; if (tep_get_field_val(s, event, "severity", record, &severity_val, 1) < 0) return -1; /* Fills the error buffer. If it is a correctable error then use the * aer_cor_errors bit field. Otherwise use aer_uncor_errors. */ if (severity_val == HW_EVENT_AER_CORRECTED) bitfield_msg(buf, sizeof(buf), aer_cor_errors, 32, 0, 0, status_val); else bitfield_msg(buf, sizeof(buf), aer_uncor_errors, 32, 0, 0, status_val); ev.msg = buf; if (tep_get_field_val(s, event, "tlp_header_valid", record, &val, 1) < 0) return -1; ev.tlp_header_valid = val; if (ev.tlp_header_valid) { ev.tlp_header = tep_get_field_raw(s, event, "tlp_header", record, &len, 1); snprintf((buf + strlen(ev.msg)), BUF_LEN - strlen(ev.msg), " TLP Header: %08x %08x %08x %08x", ev.tlp_header[0], ev.tlp_header[1], ev.tlp_header[2], ev.tlp_header[3]); } trace_seq_printf(s, "%s ", ev.msg); /* Use hw_event_aer_err_type switch between different severity_val */ switch (severity_val) { case HW_EVENT_AER_UNCORRECTED_NON_FATAL: ev.error_type = "Uncorrected (Non-Fatal)"; sel_data[0] = 0xca; break; case HW_EVENT_AER_UNCORRECTED_FATAL: ev.error_type = "Uncorrected (Fatal)"; sel_data[0] = 0xca; break; case HW_EVENT_AER_CORRECTED: ev.error_type = "Corrected"; sel_data[0] = 0xbf; break; default: ev.error_type = "Unknown severity"; sel_data[0] = 0xbf; } trace_seq_puts(s, ev.error_type); /* Insert data into the SGBD */ #ifdef HAVE_SQLITE3 ras_store_aer_event(ras, &ev); #endif #ifdef HAVE_ABRT_REPORT /* Report event to ABRT */ ras_report_aer_event(ras, &ev); #endif #ifdef HAVE_AMP_NS_DECODE /* * Get PCIe AER error source seg/bus/dev/fn and save it into * BMC OEM SEL, ipmitool raw 0x0a 0x44 is IPMI command-Add SEL * entry, please refer IPMI specificaiton chapter 31.6. 0xcd3a * is manufactuer ID(ampere),byte 12 is sensor num(CE is 0xBF, * UE is 0xCA), byte 13~14 is segment number, byte 15 is bus * number, byte 16[7:3] is device number, byte 16[2:0] is * function number */ sscanf(ev.dev_name, "%x:%x:%x.%x", &seg, &bus, &dev, &fn); sel_data[1] = seg & 0xff; sel_data[2] = (seg & 0xff00) >> 8; sel_data[3] = bus; sel_data[4] = (((dev & 0x1f) << 3) | (fn & 0x7)); sprintf(ipmi_add_sel, "ipmitool raw 0x0a 0x44 0x00 0x00 0xc0 0x00 0x00 0x00 0x00 0x3a 0xcd 0x00 0xc0 0x%02x 0x%02x 0x%02x 0x%02x 0x%02x", sel_data[0], sel_data[1], sel_data[2], sel_data[3], sel_data[4]); system(ipmi_add_sel); #endif return 0; } 07070100000047000081A400000000000000000000000165C04BE400000419000000000000000000000000000000000000003100000000rasdaemon-0.8.0.49.git+f9cb13b/ras-aer-handler.h/* * Copyright (C) 2013 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #ifndef __RAS_AER_HANDLER_H #define __RAS_AER_HANDLER_H #include "ras-events.h" #include <traceevent/event-parse.h> int ras_aer_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context); #endif 07070100000048000081A400000000000000000000000165C04BE400001B1D000000000000000000000000000000000000003100000000rasdaemon-0.8.0.49.git+f9cb13b/ras-arm-handler.c/* * Copyright (c) 2016, The Linux Foundation. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 and * only version 2 as published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <traceevent/kbuffer.h> #include "ras-arm-handler.h" #include "ras-record.h" #include "ras-logger.h" #include "ras-report.h" #include "ras-non-standard-handler.h" #include "non-standard-ampere.h" #include "ras-cpu-isolation.h" #define ARM_ERR_VALID_ERROR_COUNT BIT(0) #define ARM_ERR_VALID_FLAGS BIT(1) #define BIT2 2 void display_raw_data(struct trace_seq *s, const uint8_t *buf, uint32_t datalen) { int i = 0, line_count = 0; trace_seq_printf(s, " %08x: ", i); while (datalen >= 4) { print_le_hex(s, buf, i); i += 4; datalen -= 4; if (++line_count == 4) { trace_seq_printf(s, "\n %08x: ", i); line_count = 0; } else trace_seq_printf(s, " "); } } #ifdef HAVE_CPU_FAULT_ISOLATION static int is_core_failure(struct ras_arm_err_info *err_info) { if (err_info->validation_bits & ARM_ERR_VALID_FLAGS) { /* * core failure: * Bit 0\1\3: (at lease 1) * Bit 2: 0 */ return (err_info->flags & 0xf) && !(err_info->flags & (0x1 << BIT2)); } return 0; } static int count_errors(struct ras_arm_event *ev, int sev) { struct ras_arm_err_info *err_info; int num_pei; int err_info_size = sizeof(struct ras_arm_err_info); int num = 0; int i; int error_count; if (ev->pei_len % err_info_size != 0) { log(TERM, LOG_ERR, "The event data does not match to the ARM Processor Error Information Structure\n"); return num; } num_pei = ev->pei_len / err_info_size; err_info = (struct ras_arm_err_info *)(ev->pei_error); for (i = 0; i < num_pei; ++i) { error_count = 1; if (err_info->validation_bits & ARM_ERR_VALID_ERROR_COUNT) { /* * The value of this field is defined as follows: * 0: Single Error * 1: Multiple Errors * 2-65535: Error Count */ error_count = err_info->multiple_error + 1; } if (sev == GHES_SEV_RECOVERABLE && !is_core_failure(err_info)) error_count = 0; num += error_count; err_info += 1; } log(TERM, LOG_INFO, "%d error in cpu core catched\n", num); return num; } static int ras_handle_cpu_error(struct trace_seq *s, struct tep_record *record, struct tep_event *event, struct ras_arm_event *ev, time_t now) { unsigned long long val; int cpu; char *severity; struct error_info err_info; if (tep_get_field_val(s, event, "cpu", record, &val, 1) < 0) return -1; cpu = val; trace_seq_printf(s, "\n cpu: %d", cpu); /* record cpu error */ if (tep_get_field_val(s, event, "sev", record, &val, 1) < 0) return -1; /* refer to UEFI_2_9 specification chapter N2.2 Table N-5 */ switch (val) { case GHES_SEV_NO: severity = "Informational"; break; case GHES_SEV_CORRECTED: severity = "Corrected"; break; case GHES_SEV_RECOVERABLE: severity = "Recoverable"; break; default: case GHES_SEV_PANIC: severity = "Fatal"; } trace_seq_printf(s, "\n severity: %s", severity); if (val == GHES_SEV_CORRECTED || val == GHES_SEV_RECOVERABLE) { int nums = count_errors(ev, val); if (nums > 0) { err_info.nums = nums; err_info.time = now; err_info.err_type = val; ras_record_cpu_error(&err_info, cpu); } } return 0; } #endif int ras_arm_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) { unsigned long long val; struct ras_events *ras = context; time_t now; struct tm *tm; struct ras_arm_event ev; int len = 0; memset(&ev, 0, sizeof(ev)); /* * Newer kernels (3.10-rc1 or upper) provide an uptime clock. * On previous kernels, the way to properly generate an event would * be to inject a fake one, measure its timestamp and diff it against * gettimeofday. We won't do it here. Instead, let's use uptime, * falling-back to the event report's time, if "uptime" clock is * not available (legacy kernels). */ if (ras->use_uptime) now = record->ts / user_hz + ras->uptime_diff; else now = time(NULL); tm = localtime(&now); if (tm) strftime(ev.timestamp, sizeof(ev.timestamp), "%Y-%m-%d %H:%M:%S %z", tm); trace_seq_printf(s, "%s\n", ev.timestamp); if (tep_get_field_val(s, event, "affinity", record, &val, 1) < 0) return -1; ev.affinity = val; trace_seq_printf(s, " affinity: %d", ev.affinity); if (tep_get_field_val(s, event, "mpidr", record, &val, 1) < 0) return -1; ev.mpidr = val; trace_seq_printf(s, "\n MPIDR: 0x%llx", (unsigned long long)ev.mpidr); if (tep_get_field_val(s, event, "midr", record, &val, 1) < 0) return -1; ev.midr = val; trace_seq_printf(s, "\n MIDR: 0x%llx", (unsigned long long)ev.midr); if (tep_get_field_val(s, event, "running_state", record, &val, 1) < 0) return -1; ev.running_state = val; trace_seq_printf(s, "\n running_state: %d", ev.running_state); if (tep_get_field_val(s, event, "psci_state", record, &val, 1) < 0) return -1; ev.psci_state = val; trace_seq_printf(s, "\n psci_state: %d", ev.psci_state); if (tep_get_field_val(s, event, "pei_len", record, &val, 1) < 0) return -1; ev.pei_len = val; trace_seq_printf(s, "\n ARM Processor Err Info data len: %d\n", ev.pei_len); ev.pei_error = tep_get_field_raw(s, event, "buf", record, &len, 1); if (!ev.pei_error) return -1; display_raw_data(s, ev.pei_error, ev.pei_len); if (tep_get_field_val(s, event, "ctx_len", record, &val, 1) < 0) return -1; ev.ctx_len = val; trace_seq_printf(s, "\n ARM Processor Err Context Info data len: %d\n", ev.ctx_len); ev.ctx_error = tep_get_field_raw(s, event, "buf1", record, &len, 1); if (!ev.ctx_error) return -1; display_raw_data(s, ev.ctx_error, ev.ctx_len); if (tep_get_field_val(s, event, "oem_len", record, &val, 1) < 0) return -1; ev.oem_len = val; trace_seq_printf(s, "\n Vendor Specific Err Info data len: %d\n", ev.oem_len); ev.vsei_error = tep_get_field_raw(s, event, "buf2", record, &len, 1); if (!ev.vsei_error) return -1; #ifdef HAVE_AMP_NS_DECODE //decode ampere specific error decode_amp_payload0_err_regs(NULL, s, (struct amp_payload0_type_sec *)ev.vsei_error); #else display_raw_data(s, ev.vsei_error, ev.oem_len); #endif #ifdef HAVE_CPU_FAULT_ISOLATION if (ras_handle_cpu_error(s, record, event, &ev, now) < 0) return -1; #endif /* Insert data into the SGBD */ #ifdef HAVE_SQLITE3 ras_store_arm_record(ras, &ev); #endif #ifdef HAVE_ABRT_REPORT /* Report event to ABRT */ ras_report_arm_event(ras, &ev); #endif return 0; } 07070100000049000081A400000000000000000000000165C04BE4000004DE000000000000000000000000000000000000003100000000rasdaemon-0.8.0.49.git+f9cb13b/ras-arm-handler.h/* * Copyright (c) 2016, The Linux Foundation. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 and * only version 2 as published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. */ #ifndef __RAS_ARM_HANDLER_H #define __RAS_ARM_HANDLER_H #include "ras-events.h" #include <traceevent/event-parse.h> /* * ARM Processor Error Information Structure, According to * UEFI_2_9 specification chapter N2.4.4. */ #pragma pack(1) struct ras_arm_err_info { uint8_t version; uint8_t length; uint16_t validation_bits; uint8_t type; uint16_t multiple_error; uint8_t flags; uint64_t error_info; uint64_t virt_fault_addr; uint64_t physical_fault_addr; }; #pragma pack() int ras_arm_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context); void display_raw_data(struct trace_seq *s, const uint8_t *buf, uint32_t datalen); #endif 0707010000004A000081A400000000000000000000000165C04BE400002670000000000000000000000000000000000000003300000000rasdaemon-0.8.0.49.git+f9cb13b/ras-cpu-isolation.c/* * Copyright (c) Huawei Technologies Co., Ltd. 2021-2021. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <fcntl.h> #include <errno.h> #include <unistd.h> #include <limits.h> #include <ctype.h> #include "ras-logger.h" #include "ras-cpu-isolation.h" #define SECOND_OF_MON (30 * 24 * 60 * 60) #define SECOND_OF_DAY (24 * 60 * 60) #define SECOND_OF_HOU (60 * 60) #define SECOND_OF_MIN (60) #define LIMIT_OF_CPU_THRESHOLD 10000 #define INIT_OF_CPU_THRESHOLD 18 #define DEC_CHECK 10 #define LAST_BIT_OF_UL 5 static struct cpu_info *cpu_infos; static unsigned int ncores; static unsigned int enabled = 1; static const char *cpu_path_format = "/sys/devices/system/cpu/cpu%d/online"; static const struct param normal_units[] = { {"", 1}, {} }; static const struct param cycle_units[] = { {"d", SECOND_OF_DAY}, {"h", SECOND_OF_HOU}, {"m", SECOND_OF_MIN}, {"s", 1}, {} }; static struct isolation_param threshold = { .name = "CPU_CE_THRESHOLD", .units = normal_units, .value = INIT_OF_CPU_THRESHOLD, .limit = LIMIT_OF_CPU_THRESHOLD }; static struct isolation_param cpu_limit = { .name = "CPU_ISOLATION_LIMIT", .units = normal_units }; static struct isolation_param cycle = { .name = "CPU_ISOLATION_CYCLE", .units = cycle_units, .value = SECOND_OF_DAY, .limit = SECOND_OF_MON }; static const char * const cpu_state[] = { [CPU_OFFLINE] = "offline", [CPU_ONLINE] = "online", [CPU_OFFLINE_FAILED] = "offline-failed", [CPU_UNKNOWN] = "unknown" }; static int open_sys_file(unsigned int cpu, int __oflag, const char *format) { int fd; char path[PATH_MAX] = ""; char real_path[PATH_MAX] = ""; snprintf(path, sizeof(path), format, cpu); if (strlen(path) > PATH_MAX || realpath(path, real_path) == NULL) { log(TERM, LOG_ERR, "[%s]:open file: %s failed\n", __func__, path); return -1; } fd = open(real_path, __oflag); if (fd == -1) { log(TERM, LOG_ERR, "[%s]:open file: %s failed\n", __func__, real_path); return -1; } return fd; } static int get_cpu_status(unsigned int cpu) { int fd, num; char buf[2] = ""; fd = open_sys_file(cpu, O_RDONLY, cpu_path_format); if (fd == -1) return CPU_UNKNOWN; if (read(fd, buf, 1) <= 0 || sscanf(buf, "%d", &num) != 1) num = CPU_UNKNOWN; close(fd); return (num < 0 || num > CPU_UNKNOWN) ? CPU_UNKNOWN : num; } static int init_cpu_info(unsigned int cpus) { ncores = cpus; cpu_infos = (struct cpu_info *)malloc(sizeof(*cpu_infos) * cpus); if (!cpu_infos) { log(TERM, LOG_ERR, "Failed to allocate memory for cpu infos in %s.\n", __func__); return -1; } for (unsigned int i = 0; i < cpus; ++i) { cpu_infos[i].ce_nums = 0; cpu_infos[i].uce_nums = 0; cpu_infos[i].state = get_cpu_status(i); cpu_infos[i].ce_queue = init_queue(); if (!cpu_infos[i].ce_queue) { log(TERM, LOG_ERR, "Failed to allocate memory for cpu ce queue in %s.\n", __func__); return -1; } } /* set limit of offlined cpu limit according to number of cpu */ cpu_limit.limit = cpus - 1; cpu_limit.value = 0; return 0; } static void check_config(struct isolation_param *config) { if (config->value > config->limit) { log(TERM, LOG_WARNING, "Value: %lu exceed limit: %lu, set to limit\n", config->value, config->limit); config->value = config->limit; } } static int parse_ul_config(struct isolation_param *config, char *env, unsigned long *value) { char *unit = NULL; int env_size, has_unit = 0; if (!env || strlen(env) == 0) return -1; env_size = strlen(env); unit = env + env_size - 1; if (isalpha(*unit)) { has_unit = 1; env_size--; if (env_size <= 0) return -1; } for (int i = 0; i < env_size; ++i) { if (isdigit(env[i])) { if (*value > ULONG_MAX / DEC_CHECK || (*value == ULONG_MAX / DEC_CHECK && env[i] - '0' > LAST_BIT_OF_UL)) { log(TERM, LOG_ERR, "%s is out of range: %lu\n", env, ULONG_MAX); return -1; } *value = DEC_CHECK * (*value) + (env[i] - '0'); } else return -1; } if (!has_unit) return 0; for (const struct param *units = config->units; units->name; units++) { /* value character and unit character are both valid */ if (!strcasecmp(unit, units->name)) { if (*value > (ULONG_MAX / units->value)) { log(TERM, LOG_ERR, "%s is out of range: %lu\n", env, ULONG_MAX); return -1; } *value = (*value) * units->value; return 0; } } log(TERM, LOG_ERR, "Invalid unit %s\n", unit); return -1; } static void init_config(struct isolation_param *config) { char *env = getenv(config->name); unsigned long value = 0; if (parse_ul_config(config, env, &value) < 0) { log(TERM, LOG_ERR, "Invalid %s: %s! Use default value %lu.\n", config->name, env, config->value); return; } config->value = value; check_config(config); } static int check_config_status(void) { char *env = getenv("CPU_ISOLATION_ENABLE"); if (!env || strcasecmp(env, "yes")) return -1; return 0; } void ras_cpu_isolation_init(unsigned int cpus) { if (init_cpu_info(cpus) < 0 || check_config_status() < 0) { enabled = 0; log(TERM, LOG_WARNING, "Cpu fault isolation is disabled\n"); return; } log(TERM, LOG_INFO, "Cpu fault isolation is enabled\n"); init_config(&threshold); init_config(&cpu_limit); init_config(&cycle); } void cpu_infos_free(void) { if (cpu_infos) { for (int i = 0; i < ncores; ++i) free_queue(cpu_infos[i].ce_queue); free(cpu_infos); } } static int do_cpu_offline(unsigned int cpu) { int fd, rc; char buf[2] = ""; cpu_infos[cpu].state = CPU_OFFLINE_FAILED; fd = open_sys_file(cpu, O_RDWR, cpu_path_format); if (fd == -1) return HANDLE_FAILED; strcpy(buf, "0"); rc = write(fd, buf, strlen(buf)); if (rc < 0) { log(TERM, LOG_ERR, "cpu%u offline failed, errno:%d\n", cpu, errno); close(fd); return HANDLE_FAILED; } close(fd); /* check wthether the cpu is isolated successfully */ cpu_infos[cpu].state = get_cpu_status(cpu); if (cpu_infos[cpu].state == CPU_OFFLINE) return HANDLE_SUCCEED; return HANDLE_FAILED; } static int do_ce_handler(unsigned int cpu) { struct link_queue *queue = cpu_infos[cpu].ce_queue; unsigned int tmp; /* * Since we just count all error numbers in setted cycle, we store the time * and error numbers from current event to the queue, then everytime we * calculate the period from beginning time to ending time, if the period * exceeds setted cycle, we pop the beginning time and error until the period * from new beginning time to ending time is less than cycle. */ while (queue->head && queue->tail && queue->tail->time - queue->head->time > cycle.value) { tmp = queue->head->value; if (pop(queue) == 0) cpu_infos[cpu].ce_nums -= tmp; } log(TERM, LOG_INFO, "Current number of Corrected Errors in cpu%d in the cycle is %lu\n", cpu, cpu_infos[cpu].ce_nums); if (cpu_infos[cpu].ce_nums >= threshold.value) { log(TERM, LOG_INFO, "Corrected Errors exceeded threshold %lu, try to offline cpu%u\n", threshold.value, cpu); return do_cpu_offline(cpu); } return HANDLE_NOTHING; } static int do_uce_handler(unsigned int cpu) { if (cpu_infos[cpu].uce_nums > 0) { log(TERM, LOG_INFO, "Uncorrected Errors occurred, try to offline cpu%u\n", cpu); return do_cpu_offline(cpu); } return HANDLE_NOTHING; } static int error_handler(unsigned int cpu, struct error_info *err_info) { int ret = HANDLE_NOTHING; switch (err_info->err_type) { case CE: ret = do_ce_handler(cpu); break; case UCE: ret = do_uce_handler(cpu); break; default: break; } return ret; } static void record_error_info(unsigned int cpu, struct error_info *err_info) { switch (err_info->err_type) { case CE: { struct queue_node *node = node_create(err_info->time, err_info->nums); if (!node) { log(TERM, LOG_ERR, "Fail to allocate memory for queue node\n"); return; } push(cpu_infos[cpu].ce_queue, node); cpu_infos[cpu].ce_nums += err_info->nums; break; } case UCE: cpu_infos[cpu].uce_nums++; break; default: break; } } void ras_record_cpu_error(struct error_info *err_info, int cpu) { int ret; if (enabled == 0) return; if (cpu >= ncores || cpu < 0) { log(TERM, LOG_ERR, "The current cpu %d has exceed the total number of cpu:%u\n", cpu, ncores); return; } log(TERM, LOG_INFO, "Handling error on cpu%d\n", cpu); cpu_infos[cpu].state = get_cpu_status(cpu); if (cpu_infos[cpu].state != CPU_ONLINE) { log(TERM, LOG_INFO, "Cpu%d is not online or unknown, ignore\n", cpu); return; } record_error_info(cpu, err_info); /* * Since user may change cpu state, we get current offlined * cpu numbers every recording time. */ if (ncores - sysconf(_SC_NPROCESSORS_ONLN) >= cpu_limit.value) { log(TERM, LOG_WARNING, "Offlined cpus have exceeded limit: %lu, choose to do nothing\n", cpu_limit.value); return; } ret = error_handler(cpu, err_info); if (ret == HANDLE_NOTHING) log(TERM, LOG_WARNING, "Doing nothing in the cpu%d\n", cpu); else if (ret == HANDLE_SUCCEED) { log(TERM, LOG_INFO, "Offline cpu%d succeed, the state is %s\n", cpu, cpu_state[cpu_infos[cpu].state]); clear_queue(cpu_infos[cpu].ce_queue); cpu_infos[cpu].ce_nums = 0; cpu_infos[cpu].uce_nums = 0; } else log(TERM, LOG_WARNING, "Offline cpu%d fail, the state is %s\n", cpu, cpu_state[cpu_infos[cpu].state]); } 0707010000004B000081A400000000000000000000000165C04BE400000595000000000000000000000000000000000000003300000000rasdaemon-0.8.0.49.git+f9cb13b/ras-cpu-isolation.h/* * Copyright (c) Huawei Technologies Co., Ltd. 2021-2021. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. */ #ifndef __RAS_CPU_ISOLATION_H #define __RAS_CPU_ISOLATION_H #include "queue.h" #define MAX_BUF_LEN 1024 struct param { char *name; unsigned long value; }; struct isolation_param { char *name; const struct param *units; unsigned long value; unsigned long limit; }; enum cpu_state { CPU_OFFLINE, CPU_ONLINE, CPU_OFFLINE_FAILED, CPU_UNKNOWN, }; enum error_handle_result { HANDLE_FAILED = -1, HANDLE_SUCCEED, HANDLE_NOTHING, }; enum error_type { CE = 1, UCE }; struct cpu_info { unsigned long uce_nums; unsigned long ce_nums; struct link_queue *ce_queue; enum cpu_state state; }; struct error_info { unsigned long nums; time_t time; enum error_type err_type; }; void ras_cpu_isolation_init(unsigned int cpus); void ras_record_cpu_error(struct error_info *err_info, int cpu); void cpu_infos_free(void); #endif 0707010000004C000081A400000000000000000000000165C04BE40000836E000000000000000000000000000000000000003100000000rasdaemon-0.8.0.49.git+f9cb13b/ras-cxl-handler.c/* * Copyright (c) Huawei Technologies Co., Ltd. 2023. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <traceevent/kbuffer.h> #include "ras-cxl-handler.h" #include "ras-record.h" #include "ras-logger.h" #include "ras-report.h" #include <endian.h> /* Common Functions */ static void convert_timestamp(unsigned long long ts, char *ts_ptr, uint16_t size) { /* CXL Specification 3.0 * Overflow timestamp - The number of unsigned nanoseconds * that have elapsed since midnight, 01-Jan-1970 UTC */ time_t ts_secs = ts / 1000000000ULL; struct tm *tm; tm = localtime(&ts_secs); if (tm) strftime(ts_ptr, size, "%Y-%m-%d %H:%M:%S %z", tm); if (!ts || !tm) strncpy(ts_ptr, "1970-01-01 00:00:00 +0000", size); } static void get_timestamp(struct trace_seq *s, struct tep_record *record, struct ras_events *ras, char *ts_ptr, uint16_t size) { time_t now; struct tm *tm; now = record->ts / user_hz + ras->uptime_diff; tm = localtime(&now); if (tm) strftime(ts_ptr, size, "%Y-%m-%d %H:%M:%S %z", tm); else strncpy(ts_ptr, "1970-01-01 00:00:00 +0000", size); } struct cxl_event_flags { uint32_t bit; const char *flag; }; static int decode_cxl_event_flags(struct trace_seq *s, uint32_t flags, const struct cxl_event_flags *cxl_ev_flags, uint8_t num_elems) { int i; for (i = 0; i < num_elems; i++) { if (flags & cxl_ev_flags[i].bit) if (trace_seq_printf(s, "\'%s\' ", cxl_ev_flags[i].flag) <= 0) return -1; } return 0; } static char *uuid_be(const char *uu) { static char uuid[sizeof("xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx")]; char *p = uuid; int i; static const unsigned char be[16] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}; for (i = 0; i < 16; i++) { p += sprintf(p, "%.2x", (unsigned char)uu[be[i]]); switch (i) { case 3: case 5: case 7: case 9: *p++ = '-'; break; } } *p = 0; return uuid; } static const char *get_cxl_type_str(const char **type_array, uint8_t num_elems, uint8_t type) { if (type >= num_elems) return "Unknown"; return type_array[type]; } /* Poison List: Payload out flags */ #define CXL_POISON_FLAG_MORE BIT(0) #define CXL_POISON_FLAG_OVERFLOW BIT(1) #define CXL_POISON_FLAG_SCANNING BIT(2) /* CXL poison - source types */ enum cxl_poison_source { CXL_POISON_SOURCE_UNKNOWN = 0, CXL_POISON_SOURCE_EXTERNAL = 1, CXL_POISON_SOURCE_INTERNAL = 2, CXL_POISON_SOURCE_INJECTED = 3, CXL_POISON_SOURCE_VENDOR = 7, }; /* CXL poison - trace types */ enum cxl_poison_trace_type { CXL_POISON_TRACE_LIST, CXL_POISON_TRACE_INJECT, CXL_POISON_TRACE_CLEAR, }; int ras_cxl_poison_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) { int len; unsigned long long val; struct ras_events *ras = context; struct ras_cxl_poison_event ev; get_timestamp(s, record, ras, (char *)&ev.timestamp, sizeof(ev.timestamp)); if (trace_seq_printf(s, "%s ", ev.timestamp) <= 0) return -1; ev.memdev = tep_get_field_raw(s, event, "memdev", record, &len, 1); if (!ev.memdev) return -1; if (trace_seq_printf(s, "memdev:%s ", ev.memdev) <= 0) return -1; ev.host = tep_get_field_raw(s, event, "host", record, &len, 1); if (!ev.host) return -1; if (trace_seq_printf(s, "host:%s ", ev.host) <= 0) return -1; if (tep_get_field_val(s, event, "serial", record, &val, 1) < 0) return -1; ev.serial = val; if (trace_seq_printf(s, "serial:0x%llx ", (unsigned long long)ev.serial) <= 0) return -1; if (tep_get_field_val(s, event, "trace_type", record, &val, 1) < 0) return -1; switch (val) { case CXL_POISON_TRACE_LIST: ev.trace_type = "List"; break; case CXL_POISON_TRACE_INJECT: ev.trace_type = "Inject"; break; case CXL_POISON_TRACE_CLEAR: ev.trace_type = "Clear"; break; default: ev.trace_type = "Invalid"; } if (trace_seq_printf(s, "trace_type:%s ", ev.trace_type) <= 0) return -1; ev.region = tep_get_field_raw(s, event, "region", record, &len, 1); if (!ev.region) return -1; if (trace_seq_printf(s, "region:%s ", ev.region) <= 0) return -1; ev.uuid = tep_get_field_raw(s, event, "uuid", record, &len, 1); if (!ev.uuid) return -1; if (trace_seq_printf(s, "region_uuid:%s ", ev.uuid) <= 0) return -1; if (tep_get_field_val(s, event, "hpa", record, &val, 1) < 0) return -1; ev.hpa = val; if (trace_seq_printf(s, "poison list: hpa:0x%llx ", (unsigned long long)ev.hpa) <= 0) return -1; if (tep_get_field_val(s, event, "dpa", record, &val, 1) < 0) return -1; ev.dpa = val; if (trace_seq_printf(s, "dpa:0x%llx ", (unsigned long long)ev.dpa) <= 0) return -1; if (tep_get_field_val(s, event, "dpa_length", record, &val, 1) < 0) return -1; ev.dpa_length = val; if (trace_seq_printf(s, "dpa_length:0x%x ", ev.dpa_length) <= 0) return -1; if (tep_get_field_val(s, event, "source", record, &val, 1) < 0) return -1; switch (val) { case CXL_POISON_SOURCE_UNKNOWN: ev.source = "Unknown"; break; case CXL_POISON_SOURCE_EXTERNAL: ev.source = "External"; break; case CXL_POISON_SOURCE_INTERNAL: ev.source = "Internal"; break; case CXL_POISON_SOURCE_INJECTED: ev.source = "Injected"; break; case CXL_POISON_SOURCE_VENDOR: ev.source = "Vendor"; break; default: ev.source = "Invalid"; } if (trace_seq_printf(s, "source:%s ", ev.source) <= 0) return -1; if (tep_get_field_val(s, event, "flags", record, &val, 1) < 0) return -1; ev.flags = val; if (trace_seq_printf(s, "flags:%d ", ev.flags) <= 0) return -1; if (ev.flags & CXL_POISON_FLAG_OVERFLOW) { if (tep_get_field_val(s, event, "overflow_ts", record, &val, 1) < 0) return -1; convert_timestamp(val, ev.overflow_ts, sizeof(ev.overflow_ts)); } else strncpy(ev.overflow_ts, "1970-01-01 00:00:00 +0000", sizeof(ev.overflow_ts)); if (trace_seq_printf(s, "overflow timestamp:%s\n", ev.overflow_ts) <= 0) return -1; /* Insert data into the SGBD */ #ifdef HAVE_SQLITE3 ras_store_cxl_poison_event(ras, &ev); #endif #ifdef HAVE_ABRT_REPORT /* Report event to ABRT */ ras_report_cxl_poison_event(ras, &ev); #endif return 0; } /* CXL AER Errors */ #define CXL_AER_UE_CACHE_DATA_PARITY BIT(0) #define CXL_AER_UE_CACHE_ADDR_PARITY BIT(1) #define CXL_AER_UE_CACHE_BE_PARITY BIT(2) #define CXL_AER_UE_CACHE_DATA_ECC BIT(3) #define CXL_AER_UE_MEM_DATA_PARITY BIT(4) #define CXL_AER_UE_MEM_ADDR_PARITY BIT(5) #define CXL_AER_UE_MEM_BE_PARITY BIT(6) #define CXL_AER_UE_MEM_DATA_ECC BIT(7) #define CXL_AER_UE_REINIT_THRESH BIT(8) #define CXL_AER_UE_RSVD_ENCODE BIT(9) #define CXL_AER_UE_POISON BIT(10) #define CXL_AER_UE_RECV_OVERFLOW BIT(11) #define CXL_AER_UE_INTERNAL_ERR BIT(14) #define CXL_AER_UE_IDE_TX_ERR BIT(15) #define CXL_AER_UE_IDE_RX_ERR BIT(16) #define CXL_AER_CE_CACHE_DATA_ECC BIT(0) #define CXL_AER_CE_MEM_DATA_ECC BIT(1) #define CXL_AER_CE_CRC_THRESH BIT(2) #define CXL_AER_CE_RETRY_THRESH BIT(3) #define CXL_AER_CE_CACHE_POISON BIT(4) #define CXL_AER_CE_MEM_POISON BIT(5) #define CXL_AER_CE_PHYS_LAYER_ERR BIT(6) struct cxl_error_list { uint32_t bit; const char *error; }; static const struct cxl_error_list cxl_aer_ue[] = { { .bit = CXL_AER_UE_CACHE_DATA_PARITY, .error = "Cache Data Parity Error" }, { .bit = CXL_AER_UE_CACHE_ADDR_PARITY, .error = "Cache Address Parity Error" }, { .bit = CXL_AER_UE_CACHE_BE_PARITY, .error = "Cache Byte Enable Parity Error" }, { .bit = CXL_AER_UE_CACHE_DATA_ECC, .error = "Cache Data ECC Error" }, { .bit = CXL_AER_UE_MEM_DATA_PARITY, .error = "Memory Data Parity Error" }, { .bit = CXL_AER_UE_MEM_ADDR_PARITY, .error = "Memory Address Parity Error" }, { .bit = CXL_AER_UE_MEM_BE_PARITY, .error = "Memory Byte Enable Parity Error" }, { .bit = CXL_AER_UE_MEM_DATA_ECC, .error = "Memory Data ECC Error" }, { .bit = CXL_AER_UE_REINIT_THRESH, .error = "REINIT Threshold Hit" }, { .bit = CXL_AER_UE_RSVD_ENCODE, .error = "Received Unrecognized Encoding" }, { .bit = CXL_AER_UE_POISON, .error = "Received Poison From Peer" }, { .bit = CXL_AER_UE_RECV_OVERFLOW, .error = "Receiver Overflow" }, { .bit = CXL_AER_UE_INTERNAL_ERR, .error = "Component Specific Error" }, { .bit = CXL_AER_UE_IDE_TX_ERR, .error = "IDE Tx Error" }, { .bit = CXL_AER_UE_IDE_RX_ERR, .error = "IDE Rx Error" }, }; static const struct cxl_error_list cxl_aer_ce[] = { { .bit = CXL_AER_CE_CACHE_DATA_ECC, .error = "Cache Data ECC Error" }, { .bit = CXL_AER_CE_MEM_DATA_ECC, .error = "Memory Data ECC Error" }, { .bit = CXL_AER_CE_CRC_THRESH, .error = "CRC Threshold Hit" }, { .bit = CXL_AER_CE_RETRY_THRESH, .error = "Retry Threshold" }, { .bit = CXL_AER_CE_CACHE_POISON, .error = "Received Cache Poison From Peer" }, { .bit = CXL_AER_CE_MEM_POISON, .error = "Received Memory Poison From Peer" }, { .bit = CXL_AER_CE_PHYS_LAYER_ERR, .error = "Received Error From Physical Layer" }, }; static int decode_cxl_error_status(struct trace_seq *s, uint32_t status, const struct cxl_error_list *cxl_error_list, uint8_t num_elems) { int i; for (i = 0; i < num_elems; i++) { if (status & cxl_error_list[i].bit) if (trace_seq_printf(s, "\'%s\' ", cxl_error_list[i].error) <= 0) return -1; } return 0; } int ras_cxl_aer_ue_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) { int len, i; unsigned long long val; struct ras_events *ras = context; struct ras_cxl_aer_ue_event ev; memset(&ev, 0, sizeof(ev)); get_timestamp(s, record, ras, (char *)&ev.timestamp, sizeof(ev.timestamp)); if (trace_seq_printf(s, "%s ", ev.timestamp) <= 0) return -1; ev.memdev = tep_get_field_raw(s, event, "memdev", record, &len, 1); if (!ev.memdev) return -1; if (trace_seq_printf(s, "memdev:%s ", ev.memdev) <= 0) return -1; ev.host = tep_get_field_raw(s, event, "host", record, &len, 1); if (!ev.host) return -1; if (trace_seq_printf(s, "host:%s ", ev.host) <= 0) return -1; if (tep_get_field_val(s, event, "serial", record, &val, 1) < 0) return -1; ev.serial = val; if (trace_seq_printf(s, "serial:0x%llx ", (unsigned long long)ev.serial) <= 0) return -1; if (tep_get_field_val(s, event, "status", record, &val, 1) < 0) return -1; ev.error_status = val; if (trace_seq_printf(s, "error status:") <= 0) return -1; if (decode_cxl_error_status(s, ev.error_status, cxl_aer_ue, ARRAY_SIZE(cxl_aer_ue)) < 0) return -1; if (tep_get_field_val(s, event, "first_error", record, &val, 1) < 0) return -1; ev.first_error = val; if (trace_seq_printf(s, "first error:") <= 0) return -1; if (decode_cxl_error_status(s, ev.first_error, cxl_aer_ue, ARRAY_SIZE(cxl_aer_ue)) < 0) return -1; ev.header_log = tep_get_field_raw(s, event, "header_log", record, &len, 1); if (!ev.header_log) return -1; if (trace_seq_printf(s, "header log:\n") <= 0) return -1; for (i = 0; i < CXL_HEADERLOG_SIZE_U32; i++) { if (trace_seq_printf(s, "%08x ", ev.header_log[i]) <= 0) break; if ((i > 0) && ((i % 20) == 0)) if (trace_seq_printf(s, "\n") <= 0) break; /* Convert header log data to the big-endian format because * the SQLite database seems uses the big-endian storage. */ ev.header_log[i] = htobe32(ev.header_log[i]); } if (i < CXL_HEADERLOG_SIZE_U32) return -1; /* Insert data into the SGBD */ #ifdef HAVE_SQLITE3 ras_store_cxl_aer_ue_event(ras, &ev); #endif #ifdef HAVE_ABRT_REPORT /* Report event to ABRT */ ras_report_cxl_aer_ue_event(ras, &ev); #endif return 0; } int ras_cxl_aer_ce_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) { int len; unsigned long long val; struct ras_events *ras = context; struct ras_cxl_aer_ce_event ev; get_timestamp(s, record, ras, (char *)&ev.timestamp, sizeof(ev.timestamp)); if (trace_seq_printf(s, "%s ", ev.timestamp) <= 0) return -1; ev.memdev = tep_get_field_raw(s, event, "memdev", record, &len, 1); if (!ev.memdev) return -1; if (trace_seq_printf(s, "memdev:%s ", ev.memdev) <= 0) return -1; ev.host = tep_get_field_raw(s, event, "host", record, &len, 1); if (!ev.host) return -1; if (trace_seq_printf(s, "host:%s ", ev.host) <= 0) return -1; if (tep_get_field_val(s, event, "serial", record, &val, 1) < 0) return -1; ev.serial = val; if (trace_seq_printf(s, "serial:0x%llx ", (unsigned long long)ev.serial) <= 0) return -1; if (tep_get_field_val(s, event, "status", record, &val, 1) < 0) return -1; ev.error_status = val; if (trace_seq_printf(s, "error status:") <= 0) return -1; if (decode_cxl_error_status(s, ev.error_status, cxl_aer_ce, ARRAY_SIZE(cxl_aer_ce)) < 0) return -1; /* Insert data into the SGBD */ #ifdef HAVE_SQLITE3 ras_store_cxl_aer_ce_event(ras, &ev); #endif #ifdef HAVE_ABRT_REPORT /* Report event to ABRT */ ras_report_cxl_aer_ce_event(ras, &ev); #endif return 0; } /* * CXL rev 3.0 section 8.2.9.2.2; Table 8-49 */ enum cxl_event_log_type { CXL_EVENT_TYPE_INFO = 0x00, CXL_EVENT_TYPE_WARN, CXL_EVENT_TYPE_FAIL, CXL_EVENT_TYPE_FATAL, CXL_EVENT_TYPE_UNKNOWN }; static char *cxl_event_log_type_str(uint32_t log_type) { switch (log_type) { case CXL_EVENT_TYPE_INFO: return "Informational"; case CXL_EVENT_TYPE_WARN: return "Warning"; case CXL_EVENT_TYPE_FAIL: return "Failure"; case CXL_EVENT_TYPE_FATAL: return "Fatal"; default: break; } return "Unknown"; } int ras_cxl_overflow_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) { int len; unsigned long long val; struct ras_events *ras = context; struct ras_cxl_overflow_event ev; memset(&ev, 0, sizeof(ev)); get_timestamp(s, record, ras, (char *)&ev.timestamp, sizeof(ev.timestamp)); if (trace_seq_printf(s, "%s ", ev.timestamp) <= 0) return -1; ev.memdev = tep_get_field_raw(s, event, "memdev", record, &len, 1); if (!ev.memdev) return -1; if (trace_seq_printf(s, "memdev:%s ", ev.memdev) <= 0) return -1; ev.host = tep_get_field_raw(s, event, "host", record, &len, 1); if (!ev.host) return -1; if (trace_seq_printf(s, "host:%s ", ev.host) <= 0) return -1; if (tep_get_field_val(s, event, "serial", record, &val, 1) < 0) return -1; ev.serial = val; if (trace_seq_printf(s, "serial:0x%llx ", (unsigned long long)ev.serial) <= 0) return -1; if (tep_get_field_val(s, event, "log", record, &val, 1) < 0) return -1; ev.log_type = cxl_event_log_type_str(val); if (trace_seq_printf(s, "log type:%s ", ev.log_type) <= 0) return -1; if (tep_get_field_val(s, event, "count", record, &val, 1) < 0) return -1; ev.count = val; if (tep_get_field_val(s, event, "first_ts", record, &val, 1) < 0) return -1; convert_timestamp(val, ev.first_ts, sizeof(ev.first_ts)); if (tep_get_field_val(s, event, "last_ts", record, &val, 1) < 0) return -1; convert_timestamp(val, ev.last_ts, sizeof(ev.last_ts)); if (ev.count) { if (trace_seq_printf(s, "%u errors from %s to %s\n", ev.count, ev.first_ts, ev.last_ts) <= 0) return -1; } /* Insert data into the SGBD */ #ifdef HAVE_SQLITE3 ras_store_cxl_overflow_event(ras, &ev); #endif #ifdef HAVE_ABRT_REPORT /* Report event to ABRT */ ras_report_cxl_overflow_event(ras, &ev); #endif return 0; } /* * Common Event Record Format * CXL 3.0 section 8.2.9.2.1; Table 8-42 */ #define CXL_EVENT_RECORD_FLAG_PERMANENT BIT(2) #define CXL_EVENT_RECORD_FLAG_MAINT_NEEDED BIT(3) #define CXL_EVENT_RECORD_FLAG_PERF_DEGRADED BIT(4) #define CXL_EVENT_RECORD_FLAG_HW_REPLACE BIT(5) static const struct cxl_event_flags cxl_hdr_flags[] = { { .bit = CXL_EVENT_RECORD_FLAG_PERMANENT, .flag = "PERMANENT_CONDITION" }, { .bit = CXL_EVENT_RECORD_FLAG_MAINT_NEEDED, .flag = "MAINTENANCE_NEEDED" }, { .bit = CXL_EVENT_RECORD_FLAG_PERF_DEGRADED, .flag = "PERFORMANCE_DEGRADED" }, { .bit = CXL_EVENT_RECORD_FLAG_HW_REPLACE, .flag = "HARDWARE_REPLACEMENT_NEEDED" }, }; static int handle_ras_cxl_common_hdr(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context, struct ras_cxl_event_common_hdr *hdr) { int len; unsigned long long val; struct ras_events *ras = context; get_timestamp(s, record, ras, (char *)&hdr->timestamp, sizeof(hdr->timestamp)); if (trace_seq_printf(s, "%s ", hdr->timestamp) <= 0) return -1; hdr->memdev = tep_get_field_raw(s, event, "memdev", record, &len, 1); if (!hdr->memdev) return -1; if (trace_seq_printf(s, "memdev:%s ", hdr->memdev) <= 0) return -1; hdr->host = tep_get_field_raw(s, event, "host", record, &len, 1); if (!hdr->host) return -1; if (trace_seq_printf(s, "host:%s ", hdr->host) <= 0) return -1; if (tep_get_field_val(s, event, "serial", record, &val, 1) < 0) return -1; hdr->serial = val; if (trace_seq_printf(s, "serial:0x%llx ", (unsigned long long)hdr->serial) <= 0) return -1; if (tep_get_field_val(s, event, "log", record, &val, 1) < 0) return -1; hdr->log_type = cxl_event_log_type_str(val); if (trace_seq_printf(s, "log type:%s ", hdr->log_type) <= 0) return -1; hdr->hdr_uuid = tep_get_field_raw(s, event, "hdr_uuid", record, &len, 1); if (!hdr->hdr_uuid) return -1; hdr->hdr_uuid = uuid_be(hdr->hdr_uuid); if (trace_seq_printf(s, "hdr_uuid:%s ", hdr->hdr_uuid) <= 0) return -1; if (tep_get_field_val(s, event, "hdr_flags", record, &val, 1) < 0) return -1; hdr->hdr_flags = val; if (decode_cxl_event_flags(s, hdr->hdr_flags, cxl_hdr_flags, ARRAY_SIZE(cxl_hdr_flags)) < 0) return -1; if (tep_get_field_val(s, event, "hdr_handle", record, &val, 1) < 0) return -1; hdr->hdr_handle = val; if (trace_seq_printf(s, "hdr_handle:0x%x ", hdr->hdr_handle) <= 0) return -1; if (tep_get_field_val(s, event, "hdr_related_handle", record, &val, 1) < 0) return -1; hdr->hdr_related_handle = val; if (trace_seq_printf(s, "hdr_related_handle:0x%x ", hdr->hdr_related_handle) <= 0) return -1; if (tep_get_field_val(s, event, "hdr_timestamp", record, &val, 1) < 0) return -1; convert_timestamp(val, hdr->hdr_timestamp, sizeof(hdr->hdr_timestamp)); if (trace_seq_printf(s, "hdr_timestamp:%s ", hdr->hdr_timestamp) <= 0) return -1; if (tep_get_field_val(s, event, "hdr_length", record, &val, 1) < 0) return -1; hdr->hdr_length = val; if (trace_seq_printf(s, "hdr_length:%u ", hdr->hdr_length) <= 0) return -1; if (tep_get_field_val(s, event, "hdr_maint_op_class", record, &val, 1) < 0) return -1; hdr->hdr_maint_op_class = val; if (trace_seq_printf(s, "hdr_maint_op_class:%u ", hdr->hdr_maint_op_class) <= 0) return -1; return 0; } int ras_cxl_generic_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) { int len, i; struct ras_events *ras = context; struct ras_cxl_generic_event ev; const uint8_t *buf; memset(&ev, 0, sizeof(ev)); if (handle_ras_cxl_common_hdr(s, record, event, context, &ev.hdr) < 0) return -1; ev.data = tep_get_field_raw(s, event, "data", record, &len, 1); if (!ev.data) return -1; i = 0; buf = ev.data; if (trace_seq_printf(s, "\ndata:\n %08x: ", i) <= 0) return -1; for (i = 0; i < CXL_EVENT_RECORD_DATA_LENGTH; i += 4) { if ((i > 0) && ((i % 16) == 0)) if (trace_seq_printf(s, "\n %08x: ", i) <= 0) break; if (trace_seq_printf(s, "%02x%02x%02x%02x ", buf[i], buf[i + 1], buf[i + 2], buf[i + 3]) <= 0) break; } /* Insert data into the SGBD */ #ifdef HAVE_SQLITE3 ras_store_cxl_generic_event(ras, &ev); #endif #ifdef HAVE_ABRT_REPORT /* Report event to ABRT */ ras_report_cxl_generic_event(ras, &ev); #endif return 0; } #define CXL_DPA_VOLATILE BIT(0) #define CXL_DPA_NOT_REPAIRABLE BIT(1) static const struct cxl_event_flags cxl_dpa_flags[] = { { .bit = CXL_DPA_VOLATILE, .flag = "VOLATILE" }, { .bit = CXL_DPA_NOT_REPAIRABLE, .flag = "NOT_REPAIRABLE" }, }; /* * General Media Event Record - GMER * CXL rev 3.0 Section 8.2.9.2.1.1; Table 8-43 */ #define CXL_GMER_EVT_DESC_UNCORECTABLE_EVENT BIT(0) #define CXL_GMER_EVT_DESC_THRESHOLD_EVENT BIT(1) #define CXL_GMER_EVT_DESC_POISON_LIST_OVERFLOW BIT(2) static const struct cxl_event_flags cxl_gmer_event_desc_flags[] = { { .bit = CXL_GMER_EVT_DESC_UNCORECTABLE_EVENT, .flag = "UNCORRECTABLE EVENT" }, { .bit = CXL_GMER_EVT_DESC_THRESHOLD_EVENT, .flag = "THRESHOLD EVENT" }, { .bit = CXL_GMER_EVT_DESC_POISON_LIST_OVERFLOW, .flag = "POISON LIST OVERFLOW" }, }; #define CXL_GMER_VALID_CHANNEL BIT(0) #define CXL_GMER_VALID_RANK BIT(1) #define CXL_GMER_VALID_DEVICE BIT(2) #define CXL_GMER_VALID_COMPONENT BIT(3) static const char *cxl_gmer_mem_event_type[] = { "ECC Error", "Invalid Address", "Data Path Error", }; static const char *cxl_gmer_trans_type[] = { "Unknown", "Host Read", "Host Write", "Host Scan Media", "Host Inject Poison", "Internal Media Scrub", "Internal Media Management", }; int ras_cxl_general_media_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) { int len, i; unsigned long long val; struct ras_events *ras = context; struct ras_cxl_general_media_event ev; memset(&ev, 0, sizeof(ev)); if (handle_ras_cxl_common_hdr(s, record, event, context, &ev.hdr) < 0) return -1; if (tep_get_field_val(s, event, "dpa", record, &val, 1) < 0) return -1; ev.dpa = val; if (trace_seq_printf(s, "dpa:0x%llx ", (unsigned long long)ev.dpa) <= 0) return -1; if (tep_get_field_val(s, event, "dpa_flags", record, &val, 1) < 0) return -1; ev.dpa_flags = val; if (trace_seq_printf(s, "dpa_flags:") <= 0) return -1; if (decode_cxl_event_flags(s, ev.dpa_flags, cxl_dpa_flags, ARRAY_SIZE(cxl_dpa_flags)) < 0) return -1; if (tep_get_field_val(s, event, "descriptor", record, &val, 1) < 0) return -1; ev.descriptor = val; if (trace_seq_printf(s, "descriptor:") <= 0) return -1; if (decode_cxl_event_flags(s, ev.descriptor, cxl_gmer_event_desc_flags, ARRAY_SIZE(cxl_gmer_event_desc_flags)) < 0) return -1; if (tep_get_field_val(s, event, "type", record, &val, 1) < 0) return -1; ev.type = val; if (trace_seq_printf(s, "type:%s ", get_cxl_type_str(cxl_gmer_mem_event_type, ARRAY_SIZE(cxl_gmer_mem_event_type), ev.type)) <= 0) return -1; if (tep_get_field_val(s, event, "transaction_type", record, &val, 1) < 0) return -1; ev.transaction_type = val; if (trace_seq_printf(s, "transaction_type:%s ", get_cxl_type_str(cxl_gmer_trans_type, ARRAY_SIZE(cxl_gmer_trans_type), ev.transaction_type)) <= 0) return -1; if (tep_get_field_val(s, event, "validity_flags", record, &val, 1) < 0) return -1; ev.validity_flags = val; if (ev.validity_flags & CXL_GMER_VALID_CHANNEL) { if (tep_get_field_val(s, event, "channel", record, &val, 1) < 0) return -1; ev.channel = val; if (trace_seq_printf(s, "channel:%u ", ev.channel) <= 0) return -1; } if (ev.validity_flags & CXL_GMER_VALID_RANK) { if (tep_get_field_val(s, event, "rank", record, &val, 1) < 0) return -1; ev.rank = val; if (trace_seq_printf(s, "rank:%u ", ev.rank) <= 0) return -1; } if (ev.validity_flags & CXL_GMER_VALID_DEVICE) { if (tep_get_field_val(s, event, "device", record, &val, 1) < 0) return -1; ev.device = val; if (trace_seq_printf(s, "device:%x ", ev.device) <= 0) return -1; } if (ev.validity_flags & CXL_GMER_VALID_COMPONENT) { ev.comp_id = tep_get_field_raw(s, event, "comp_id", record, &len, 1); if (!ev.comp_id) return -1; if (trace_seq_printf(s, "comp_id:") <= 0) return -1; for (i = 0; i < CXL_EVENT_GEN_MED_COMP_ID_SIZE; i++) { if (trace_seq_printf(s, "%02x ", ev.comp_id[i]) <= 0) break; } } /* Insert data into the SGBD */ #ifdef HAVE_SQLITE3 ras_store_cxl_general_media_event(ras, &ev); #endif #ifdef HAVE_ABRT_REPORT /* Report event to ABRT */ ras_report_cxl_general_media_event(ras, &ev); #endif return 0; } /* * DRAM Event Record - DER * * CXL rev 3.0 section 8.2.9.2.1.2; Table 8-44 */ #define CXL_DER_VALID_CHANNEL BIT(0) #define CXL_DER_VALID_RANK BIT(1) #define CXL_DER_VALID_NIBBLE BIT(2) #define CXL_DER_VALID_BANK_GROUP BIT(3) #define CXL_DER_VALID_BANK BIT(4) #define CXL_DER_VALID_ROW BIT(5) #define CXL_DER_VALID_COLUMN BIT(6) #define CXL_DER_VALID_CORRECTION_MASK BIT(7) int ras_cxl_dram_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) { int len, i; unsigned long long val; struct ras_events *ras = context; struct ras_cxl_dram_event ev; memset(&ev, 0, sizeof(ev)); if (handle_ras_cxl_common_hdr(s, record, event, context, &ev.hdr) < 0) return -1; if (tep_get_field_val(s, event, "dpa", record, &val, 1) < 0) return -1; ev.dpa = val; if (trace_seq_printf(s, "dpa:0x%llx ", (unsigned long long)ev.dpa) <= 0) return -1; if (tep_get_field_val(s, event, "dpa_flags", record, &val, 1) < 0) return -1; ev.dpa_flags = val; if (trace_seq_printf(s, "dpa_flags:") <= 0) return -1; if (decode_cxl_event_flags(s, ev.dpa_flags, cxl_dpa_flags, ARRAY_SIZE(cxl_dpa_flags)) < 0) return -1; if (tep_get_field_val(s, event, "descriptor", record, &val, 1) < 0) return -1; ev.descriptor = val; if (trace_seq_printf(s, "descriptor:") <= 0) return -1; if (decode_cxl_event_flags(s, ev.descriptor, cxl_gmer_event_desc_flags, ARRAY_SIZE(cxl_gmer_event_desc_flags)) < 0) return -1; if (tep_get_field_val(s, event, "type", record, &val, 1) < 0) return -1; ev.type = val; if (trace_seq_printf(s, "type:%s ", get_cxl_type_str(cxl_gmer_mem_event_type, ARRAY_SIZE(cxl_gmer_mem_event_type), ev.type)) <= 0) return -1; if (tep_get_field_val(s, event, "transaction_type", record, &val, 1) < 0) return -1; ev.transaction_type = val; if (trace_seq_printf(s, "transaction_type:%s ", get_cxl_type_str(cxl_gmer_trans_type, ARRAY_SIZE(cxl_gmer_trans_type), ev.transaction_type)) <= 0) return -1; if (tep_get_field_val(s, event, "validity_flags", record, &val, 1) < 0) return -1; ev.validity_flags = val; if (ev.validity_flags & CXL_DER_VALID_CHANNEL) { if (tep_get_field_val(s, event, "channel", record, &val, 1) < 0) return -1; ev.channel = val; if (trace_seq_printf(s, "channel:%u ", ev.channel) <= 0) return -1; } if (ev.validity_flags & CXL_DER_VALID_RANK) { if (tep_get_field_val(s, event, "rank", record, &val, 1) < 0) return -1; ev.rank = val; if (trace_seq_printf(s, "rank:%u ", ev.rank) <= 0) return -1; } if (ev.validity_flags & CXL_DER_VALID_NIBBLE) { if (tep_get_field_val(s, event, "nibble_mask", record, &val, 1) < 0) return -1; ev.nibble_mask = val; if (trace_seq_printf(s, "nibble_mask:%u ", ev.nibble_mask) <= 0) return -1; } if (ev.validity_flags & CXL_DER_VALID_BANK_GROUP) { if (tep_get_field_val(s, event, "bank_group", record, &val, 1) < 0) return -1; ev.bank_group = val; if (trace_seq_printf(s, "bank_group:%u ", ev.bank_group) <= 0) return -1; } if (ev.validity_flags & CXL_DER_VALID_BANK) { if (tep_get_field_val(s, event, "bank", record, &val, 1) < 0) return -1; ev.bank = val; if (trace_seq_printf(s, "bank:%u ", ev.bank) <= 0) return -1; } if (ev.validity_flags & CXL_DER_VALID_ROW) { if (tep_get_field_val(s, event, "row", record, &val, 1) < 0) return -1; ev.row = val; if (trace_seq_printf(s, "row:%u ", ev.row) <= 0) return -1; } if (ev.validity_flags & CXL_DER_VALID_COLUMN) { if (tep_get_field_val(s, event, "column", record, &val, 1) < 0) return -1; ev.column = val; if (trace_seq_printf(s, "column:%u ", ev.column) <= 0) return -1; } if (ev.validity_flags & CXL_DER_VALID_CORRECTION_MASK) { ev.cor_mask = tep_get_field_raw(s, event, "cor_mask", record, &len, 1); if (!ev.cor_mask) return -1; if (trace_seq_printf(s, "correction_mask:") <= 0) return -1; for (i = 0; i < CXL_EVENT_DER_CORRECTION_MASK_SIZE; i++) { if (trace_seq_printf(s, "%02x ", ev.cor_mask[i]) <= 0) break; } } /* Insert data into the SGBD */ #ifdef HAVE_SQLITE3 ras_store_cxl_dram_event(ras, &ev); #endif #ifdef HAVE_ABRT_REPORT /* Report event to ABRT */ ras_report_cxl_dram_event(ras, &ev); #endif return 0; } /* * Memory Module Event Record - MMER * * CXL res 3.0 section 8.2.9.2.1.3; Table 8-45 */ static const char *cxl_dev_evt_type[] = { "Health Status Change", "Media Status Change", "Life Used Change", "Temperature Change", "Data Path Error", "LSA Error", }; /* * Device Health Information - DHI * * CXL res 3.0 section 8.2.9.8.3.1; Table 8-100 */ #define CXL_DHI_HS_MAINTENANCE_NEEDED BIT(0) #define CXL_DHI_HS_PERFORMANCE_DEGRADED BIT(1) #define CXL_DHI_HS_HW_REPLACEMENT_NEEDED BIT(2) static const struct cxl_event_flags cxl_health_status[] = { { .bit = CXL_DHI_HS_MAINTENANCE_NEEDED, .flag = "MAINTENANCE_NEEDED" }, { .bit = CXL_DHI_HS_PERFORMANCE_DEGRADED, .flag = "PERFORMANCE_DEGRADED" }, { .bit = CXL_DHI_HS_HW_REPLACEMENT_NEEDED, .flag = "REPLACEMENT_NEEDED" }, }; static const char *cxl_media_status[] = { "Normal", "Not Ready", "Write Persistency Lost", "All Data Lost", "Write Persistency Loss in the Event of Power Loss", "Write Persistency Loss in Event of Shutdown", "Write Persistency Loss Imminent", "All Data Loss in Event of Power Loss", "All Data loss in the Event of Shutdown", "All Data Loss Imminent", }; static const char *cxl_two_bit_status[] = { "Normal", "Warning", "Critical", }; static const char *cxl_one_bit_status[] = { "Normal", "Warning", }; #define CXL_DHI_AS_LIFE_USED(as) (as & 0x3) #define CXL_DHI_AS_DEV_TEMP(as) ((as & 0xC) >> 2) #define CXL_DHI_AS_COR_VOL_ERR_CNT(as) ((as & 0x10) >> 4) #define CXL_DHI_AS_COR_PER_ERR_CNT(as) ((as & 0x20) >> 5) int ras_cxl_memory_module_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) { unsigned long long val; struct ras_events *ras = context; struct ras_cxl_memory_module_event ev; memset(&ev, 0, sizeof(ev)); if (handle_ras_cxl_common_hdr(s, record, event, context, &ev.hdr) < 0) return -1; if (tep_get_field_val(s, event, "event_type", record, &val, 1) < 0) return -1; ev.event_type = val; if (trace_seq_printf(s, "event_type:%s ", get_cxl_type_str(cxl_dev_evt_type, ARRAY_SIZE(cxl_dev_evt_type), ev.event_type)) <= 0) return -1; if (tep_get_field_val(s, event, "health_status", record, &val, 1) < 0) return -1; ev.health_status = val; if (trace_seq_printf(s, "health_status:") <= 0) return -1; if (decode_cxl_event_flags(s, ev.health_status, cxl_health_status, ARRAY_SIZE(cxl_health_status)) < 0) return -1; if (tep_get_field_val(s, event, "media_status", record, &val, 1) < 0) return -1; ev.media_status = val; if (trace_seq_printf(s, "media_status:%s ", get_cxl_type_str(cxl_media_status, ARRAY_SIZE(cxl_media_status), ev.media_status)) <= 0) return -1; if (tep_get_field_val(s, event, "add_status", record, &val, 1) < 0) return -1; ev.add_status = val; if (trace_seq_printf(s, "as_life_used:%s ", get_cxl_type_str(cxl_two_bit_status, ARRAY_SIZE(cxl_two_bit_status), CXL_DHI_AS_LIFE_USED(ev.add_status))) <= 0) return -1; if (trace_seq_printf(s, "as_dev_temp:%s ", get_cxl_type_str(cxl_two_bit_status, ARRAY_SIZE(cxl_two_bit_status), CXL_DHI_AS_DEV_TEMP(ev.add_status))) <= 0) return -1; if (trace_seq_printf(s, "as_cor_vol_err_cnt:%s ", get_cxl_type_str(cxl_one_bit_status, ARRAY_SIZE(cxl_one_bit_status), CXL_DHI_AS_COR_VOL_ERR_CNT(ev.add_status))) <= 0) return -1; if (trace_seq_printf(s, "as_cor_per_err_cnt:%s ", get_cxl_type_str(cxl_one_bit_status, ARRAY_SIZE(cxl_one_bit_status), CXL_DHI_AS_COR_PER_ERR_CNT(ev.add_status))) <= 0) return -1; if (tep_get_field_val(s, event, "life_used", record, &val, 1) < 0) return -1; ev.life_used = val; if (trace_seq_printf(s, "life_used:%u ", ev.life_used) <= 0) return -1; if (tep_get_field_val(s, event, "device_temp", record, &val, 1) < 0) return -1; ev.device_temp = val; if (trace_seq_printf(s, "device_temp:%u ", ev.device_temp) <= 0) return -1; if (tep_get_field_val(s, event, "dirty_shutdown_cnt", record, &val, 1) < 0) return -1; ev.dirty_shutdown_cnt = val; if (trace_seq_printf(s, "dirty_shutdown_cnt:%u ", ev.dirty_shutdown_cnt) <= 0) return -1; if (tep_get_field_val(s, event, "cor_vol_err_cnt", record, &val, 1) < 0) return -1; ev.cor_vol_err_cnt = val; if (trace_seq_printf(s, "cor_vol_err_cnt:%u ", ev.cor_vol_err_cnt) <= 0) return -1; if (tep_get_field_val(s, event, "cor_per_err_cnt", record, &val, 1) < 0) return -1; ev.cor_per_err_cnt = val; if (trace_seq_printf(s, "cor_per_err_cnt:%u ", ev.cor_per_err_cnt) <= 0) return -1; /* Insert data into the SGBD */ #ifdef HAVE_SQLITE3 ras_store_cxl_memory_module_event(ras, &ev); #endif #ifdef HAVE_ABRT_REPORT /* Report event to ABRT */ ras_report_cxl_memory_module_event(ras, &ev); #endif return 0; } 0707010000004D000081A400000000000000000000000165C04BE400000708000000000000000000000000000000000000003100000000rasdaemon-0.8.0.49.git+f9cb13b/ras-cxl-handler.h/* * Copyright (c) Huawei Technologies Co., Ltd. 2023. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. */ #ifndef __RAS_CXL_HANDLER_H #define __RAS_CXL_HANDLER_H #include "ras-events.h" #include <traceevent/event-parse.h> int ras_cxl_poison_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context); int ras_cxl_aer_ue_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context); int ras_cxl_aer_ce_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context); int ras_cxl_overflow_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context); int ras_cxl_generic_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context); int ras_cxl_general_media_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context); int ras_cxl_dram_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context); int ras_cxl_memory_module_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context); #endif 0707010000004E000081A400000000000000000000000165C04BE400000F7A000000000000000000000000000000000000003500000000rasdaemon-0.8.0.49.git+f9cb13b/ras-devlink-handler.c/* * Copyright (C) 2019 Cong Wang <xiyou.wangcong@gmail.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <traceevent/kbuffer.h> #include "ras-devlink-handler.h" #include "ras-record.h" #include "ras-logger.h" #include "ras-report.h" int ras_net_xmit_timeout_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) { unsigned long long val; int len; struct ras_events *ras = context; time_t now; struct tm *tm; struct devlink_event ev; if (ras->use_uptime) now = record->ts / user_hz + ras->uptime_diff; else now = time(NULL); tm = localtime(&now); if (tm) strftime(ev.timestamp, sizeof(ev.timestamp), "%Y-%m-%d %H:%M:%S %z", tm); trace_seq_printf(s, "%s ", ev.timestamp); ev.bus_name = ""; ev.reporter_name = ""; ev.dev_name = tep_get_field_raw(s, event, "name", record, &len, 1); if (!ev.dev_name) return -1; ev.driver_name = tep_get_field_raw(s, event, "driver", record, &len, 1); if (!ev.driver_name) return -1; if (tep_get_field_val(s, event, "queue_index", record, &val, 1) < 0) return -1; if (asprintf(&ev.msg, "TX timeout on queue: %d\n", (int)val) < 0) return -1; /* Insert data into the SGBD */ #ifdef HAVE_SQLITE3 ras_store_devlink_event(ras, &ev); #endif #ifdef HAVE_ABRT_REPORT /* Report event to ABRT */ ras_report_devlink_event(ras, &ev); #endif free(ev.msg); return 0; } int ras_devlink_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) { int len; struct ras_events *ras = context; time_t now; struct tm *tm; struct devlink_event ev; if (ras->filters[DEVLINK_EVENT] && tep_filter_match(ras->filters[DEVLINK_EVENT], record) == FILTER_MATCH) return 0; /* * Newer kernels (3.10-rc1 or upper) provide an uptime clock. * On previous kernels, the way to properly generate an event would * be to inject a fake one, measure its timestamp and diff it against * gettimeofday. We won't do it here. Instead, let's use uptime, * falling-back to the event report's time, if "uptime" clock is * not available (legacy kernels). */ if (ras->use_uptime) now = record->ts / user_hz + ras->uptime_diff; else now = time(NULL); tm = localtime(&now); if (tm) strftime(ev.timestamp, sizeof(ev.timestamp), "%Y-%m-%d %H:%M:%S %z", tm); trace_seq_printf(s, "%s ", ev.timestamp); ev.bus_name = tep_get_field_raw(s, event, "bus_name", record, &len, 1); if (!ev.bus_name) return -1; ev.dev_name = tep_get_field_raw(s, event, "dev_name", record, &len, 1); if (!ev.dev_name) return -1; ev.driver_name = tep_get_field_raw(s, event, "driver_name", record, &len, 1); if (!ev.driver_name) return -1; ev.reporter_name = tep_get_field_raw(s, event, "reporter_name", record, &len, 1); if (!ev.reporter_name) return -1; ev.msg = tep_get_field_raw(s, event, "msg", record, &len, 1); if (!ev.msg) return -1; /* Insert data into the SGBD */ #ifdef HAVE_SQLITE3 ras_store_devlink_event(ras, &ev); #endif #ifdef HAVE_ABRT_REPORT /* Report event to ABRT */ ras_report_devlink_event(ras, &ev); #endif return 0; } 0707010000004F000081A400000000000000000000000165C04BE4000004A5000000000000000000000000000000000000003500000000rasdaemon-0.8.0.49.git+f9cb13b/ras-devlink-handler.h/* * Copyright (C) 2019 Cong Wang <xiyou.wangcong@gmail.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #ifndef __RAS_DEVLINK_HANDLER_H #define __RAS_DEVLINK_HANDLER_H #include "ras-events.h" #include <traceevent/event-parse.h> int ras_net_xmit_timeout_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context); int ras_devlink_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context); #endif 07070100000050000081A400000000000000000000000165C04BE400000EC8000000000000000000000000000000000000003700000000rasdaemon-0.8.0.49.git+f9cb13b/ras-diskerror-handler.c/* * Copyright (C) 2019 Cong Wang <xiyou.wangcong@gmail.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #ifndef __dev_t_defined #include <sys/types.h> #endif /* __dev_t_defined */ #include <string.h> #include <errno.h> #include <sys/sysmacros.h> #include <traceevent/kbuffer.h> #include "ras-diskerror-handler.h" #include "ras-record.h" #include "ras-logger.h" #include "ras-report.h" static const struct { int error; const char *name; } blk_errors[] = { { -EOPNOTSUPP, "operation not supported error" }, { -ETIMEDOUT, "timeout error" }, { -ENOSPC, "critical space allocation error" }, { -ENOLINK, "recoverable transport error" }, { -EREMOTEIO, "critical target error" }, { -EBADE, "critical nexus error" }, { -ENODATA, "critical medium error" }, { -EILSEQ, "protection error" }, { -ENOMEM, "kernel resource error" }, { -EBUSY, "device resource error" }, { -EAGAIN, "nonblocking retry error" }, { -EREMCHG, "dm internal retry error" }, { -EIO, "I/O error" }, }; static const char *get_blk_error(int err) { unsigned int i; for (i = 0; i < ARRAY_SIZE(blk_errors); i++) if (blk_errors[i].error == err) return blk_errors[i].name; return "unknown block error"; } int ras_diskerror_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) { unsigned long long val; int len; struct ras_events *ras = context; time_t now; struct tm *tm; struct diskerror_event ev; dev_t dev; /* * Newer kernels (3.10-rc1 or upper) provide an uptime clock. * On previous kernels, the way to properly generate an event would * be to inject a fake one, measure its timestamp and diff it against * gettimeofday. We won't do it here. Instead, let's use uptime, * falling-back to the event report's time, if "uptime" clock is * not available (legacy kernels). */ if (ras->use_uptime) now = record->ts / user_hz + ras->uptime_diff; else now = time(NULL); tm = localtime(&now); if (tm) strftime(ev.timestamp, sizeof(ev.timestamp), "%Y-%m-%d %H:%M:%S %z", tm); trace_seq_printf(s, "%s ", ev.timestamp); if (tep_get_field_val(s, event, "dev", record, &val, 1) < 0) return -1; dev = (dev_t)val; if (asprintf(&ev.dev, "%u:%u", major(dev), minor(dev)) < 0) return -1; if (tep_get_field_val(s, event, "sector", record, &val, 1) < 0) return -1; ev.sector = val; if (tep_get_field_val(s, event, "nr_sector", record, &val, 1) < 0) return -1; ev.nr_sector = (unsigned int)val; if (tep_get_field_val(s, event, "error", record, &val, 1) < 0) return -1; ev.error = get_blk_error((int)val); ev.rwbs = tep_get_field_raw(s, event, "rwbs", record, &len, 1); if (!ev.rwbs) return -1; ev.cmd = tep_get_field_raw(s, event, "cmd", record, &len, 1); if (!ev.cmd) return -1; /* Insert data into the SGBD */ #ifdef HAVE_SQLITE3 ras_store_diskerror_event(ras, &ev); #endif #ifdef HAVE_ABRT_REPORT /* Report event to ABRT */ ras_report_diskerror_event(ras, &ev); #endif free(ev.dev); return 0; } 07070100000051000081A400000000000000000000000165C04BE40000041C000000000000000000000000000000000000003700000000rasdaemon-0.8.0.49.git+f9cb13b/ras-diskerror-handler.h/* * Copyright (C) 2019 Cong Wang <xiyou.wangcong@gmail.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #ifndef __RAS_DISKERROR_HANDLER_H #define __RAS_DISKERROR_HANDLER_H #include "ras-events.h" #include <traceevent/event-parse.h> int ras_diskerror_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context); #endif 07070100000052000081A400000000000000000000000165C04BE400007248000000000000000000000000000000000000002C00000000rasdaemon-0.8.0.49.git+f9cb13b/ras-events.c/* * Copyright (C) 2013 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <dirent.h> #include <errno.h> #include <fcntl.h> #include <limits.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <sys/stat.h> #include <sys/types.h> #include <sys/poll.h> #include <signal.h> #include <sys/signalfd.h> #include <linux/version.h> #include <traceevent/kbuffer.h> #include <traceevent/event-parse.h> #include "ras-mc-handler.h" #include "ras-aer-handler.h" #include "ras-non-standard-handler.h" #include "ras-arm-handler.h" #include "ras-mce-handler.h" #include "ras-extlog-handler.h" #include "ras-devlink-handler.h" #include "ras-diskerror-handler.h" #include "ras-memory-failure-handler.h" #include "ras-cxl-handler.h" #include "ras-record.h" #include "ras-logger.h" #include "ras-page-isolation.h" #include "ras-cpu-isolation.h" /* * Polling time, if read() doesn't block. Currently, trace_pipe_raw never * blocks on read(). So, we need to sleep for a while, to avoid spending * too much CPU cycles. A fix for it is expected for 3.10. */ #define POLLING_TIME 3 /* Test for a little-endian machine */ #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ #define ENDIAN KBUFFER_ENDIAN_LITTLE #else #define ENDIAN KBUFFER_ENDIAN_BIG #endif extern char *choices_disable; static int get_debugfs_dir(char *tracing_dir, size_t len) { FILE *fp; char line[MAX_PATH + 1 + 256]; char *p, *type, *dir; fp = fopen("/proc/mounts", "r"); if (!fp) { log(ALL, LOG_INFO, "Can't open /proc/mounts"); return errno; } do { if (!fgets(line, sizeof(line), fp)) break; p = strtok(line, " \t"); if (!p) break; dir = strtok(NULL, " \t"); if (!dir) break; type = strtok(NULL, " \t"); if (!type) break; if (!strcmp(type, "debugfs")) { fclose(fp); strncpy(tracing_dir, dir, len - 1); tracing_dir[len - 1] = '\0'; return 0; } } while (1); fclose(fp); log(ALL, LOG_INFO, "Can't find debugfs\n"); return ENOENT; } static int open_trace(struct ras_events *ras, char *name, int flags) { char fname[MAX_PATH + 1]; strcpy(fname, ras->tracing); strcat(fname, "/"); strcat(fname, name); return open(fname, flags); } static int get_tracing_dir(struct ras_events *ras) { char fname[MAX_PATH + 1]; int rc, has_instances = 0; DIR *dir; struct dirent *entry; get_debugfs_dir(ras->debugfs, sizeof(ras->debugfs)); strcpy(fname, ras->debugfs); strcat(fname, "/tracing"); dir = opendir(fname); if (!dir) return -1; for (entry = readdir(dir); entry; entry = readdir(dir)) { if (strstr(entry->d_name, "instances")) { has_instances = 1; break; } } closedir(dir); strcpy(ras->tracing, ras->debugfs); strcat(ras->tracing, "/tracing"); if (has_instances) { strcat(ras->tracing, "/instances/" TOOL_NAME); rc = mkdir(ras->tracing, 0700); if (rc < 0 && errno != EEXIST) { log(ALL, LOG_INFO, "Unable to create " TOOL_NAME " instance at %s\n", ras->tracing); return -1; } } return 0; } static int is_disabled_event(char *group, char *event) { char ras_event_name[MAX_PATH + 1]; snprintf(ras_event_name, sizeof(ras_event_name), "%s:%s", group, event); if (choices_disable && strlen(choices_disable) != 0 && strstr(choices_disable, ras_event_name)) { return 1; } return 0; } /* * Tracing enable/disable code */ static int __toggle_ras_mc_event(struct ras_events *ras, char *group, char *event, int enable) { int fd, rc; char fname[MAX_PATH + 1]; enable = is_disabled_event(group, event) ? 0 : 1; snprintf(fname, sizeof(fname), "%s%s:%s\n", enable ? "" : "!", group, event); /* Enable RAS events */ fd = open_trace(ras, "set_event", O_RDWR | O_APPEND); if (fd < 0) { log(ALL, LOG_WARNING, "Can't open set_event\n"); return errno; } rc = write(fd, fname, strlen(fname)); if (rc < 0) { log(ALL, LOG_WARNING, "Can't write to set_event\n"); close(fd); return rc; } close(fd); if (!rc) { log(ALL, LOG_WARNING, "Nothing was written on set_event\n"); return EIO; } log(ALL, LOG_INFO, "%s:%s event %s\n", group, event, enable ? "enabled" : "disabled"); return 0; } int toggle_ras_mc_event(int enable) { struct ras_events *ras; int rc = 0; ras = calloc(1, sizeof(*ras)); if (!ras) { log(TERM, LOG_ERR, "Can't allocate memory for ras struct\n"); return errno; } rc = get_tracing_dir(ras); if (rc < 0) { log(TERM, LOG_ERR, "Can't locate a mounted debugfs\n"); goto free_ras; } rc = __toggle_ras_mc_event(ras, "ras", "mc_event", enable); #ifdef HAVE_AER rc |= __toggle_ras_mc_event(ras, "ras", "aer_event", enable); #endif #ifdef HAVE_MCE rc |= __toggle_ras_mc_event(ras, "mce", "mce_record", enable); #endif #ifdef HAVE_EXTLOG rc |= __toggle_ras_mc_event(ras, "ras", "extlog_mem_event", enable); #endif #ifdef HAVE_NON_STANDARD rc |= __toggle_ras_mc_event(ras, "ras", "non_standard_event", enable); #endif #ifdef HAVE_ARM rc |= __toggle_ras_mc_event(ras, "ras", "arm_event", enable); #endif #ifdef HAVE_DEVLINK rc |= __toggle_ras_mc_event(ras, "devlink", "devlink_health_report", enable); #endif #ifdef HAVE_DISKERROR #if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 18, 0) rc |= __toggle_ras_mc_event(ras, "block", "block_rq_error", enable); #else rc |= __toggle_ras_mc_event(ras, "block", "block_rq_complete", enable); #endif #endif #ifdef HAVE_MEMORY_FAILURE rc |= __toggle_ras_mc_event(ras, "ras", "memory_failure_event", enable); #endif #ifdef HAVE_CXL rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_poison", enable); rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_aer_uncorrectable_error", enable); rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_aer_correctable_error", enable); rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_overflow", enable); rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_generic_event", enable); rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_general_media", enable); rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_dram", enable); rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_memory_module", enable); #endif free_ras: free(ras); return rc; } #if LINUX_VERSION_CODE < KERNEL_VERSION(5, 18, 0) /* * Set kernel filter. libtrace doesn't provide an API for setting filters * in kernel, we have to implement it here. */ static int filter_ras_mc_event(struct ras_events *ras, char *group, char *event, const char *filter_str) { int fd, rc; char fname[MAX_PATH + 1]; snprintf(fname, sizeof(fname), "events/%s/%s/filter", group, event); fd = open_trace(ras, fname, O_RDWR | O_APPEND); if (fd < 0) { log(ALL, LOG_WARNING, "Can't open filter file\n"); return errno; } rc = write(fd, filter_str, strlen(filter_str)); if (rc < 0) { log(ALL, LOG_WARNING, "Can't write to filter file\n"); close(fd); return rc; } close(fd); if (!rc) { log(ALL, LOG_WARNING, "Nothing was written on filter file\n"); return EIO; } return 0; } #endif /* * Tracing read code */ static int get_pagesize(struct ras_events *ras, struct tep_handle *pevent) { int fd, len, page_size = 4096; char buf[page_size]; fd = open_trace(ras, "events/header_page", O_RDONLY); if (fd < 0) return page_size; len = read(fd, buf, page_size); if (len <= 0) goto error; if (tep_parse_header_page(pevent, buf, len, sizeof(long))) goto error; error: close(fd); return page_size; } static void parse_ras_data(struct pthread_data *pdata, struct kbuffer *kbuf, void *data, unsigned long long time_stamp) { struct tep_record record; struct trace_seq s; record.ts = time_stamp; record.size = kbuffer_event_size(kbuf); record.data = data; record.offset = kbuffer_curr_offset(kbuf); record.cpu = pdata->cpu; /* note offset is just offset in subbuffer */ record.missed_events = kbuffer_missed_events(kbuf); record.record_size = kbuffer_curr_size(kbuf); /* TODO - logging */ trace_seq_init(&s); tep_print_event(pdata->ras->pevent, &s, &record, "%16s-%-5d [%03d] %s %6.1000d %s %s", TEP_PRINT_COMM, TEP_PRINT_PID, TEP_PRINT_CPU, TEP_PRINT_LATENCY, TEP_PRINT_TIME, TEP_PRINT_NAME, TEP_PRINT_INFO); trace_seq_do_printf(&s); printf("\n"); fflush(stdout); trace_seq_destroy(&s); } static int get_num_cpus(struct ras_events *ras) { return sysconf(_SC_NPROCESSORS_ONLN); #if 0 char fname[MAX_PATH + 1]; int num_cpus = 0; DIR *dir; struct dirent *entry; strcpy(fname, ras->debugfs); strcat(fname, "/tracing/per_cpu/"); dir = opendir(fname); if (!dir) return -1; for (entry = readdir(dir); entry; entry = readdir(dir)) { if (strstr(entry->d_name, "cpu")) num_cpus++; } closedir(dir); return num_cpus; #endif } static int set_buffer_percent(struct ras_events *ras, int percent) { char buf[16]; ssize_t size; int res = 0; int fd; fd = open_trace(ras, "buffer_percent", O_WRONLY); if (fd >= 0) { /* For the backward compatibility to the old kernels, do not return * if fail to set the buffer_percent. */ snprintf(buf, sizeof(buf), "%d", percent); size = write(fd, buf, strlen(buf)); if (size <= 0) { log(TERM, LOG_WARNING, "can't write to buffer_percent\n"); res = -1; } close(fd); } else { log(TERM, LOG_WARNING, "Can't open buffer_percent\n"); res = -1; } return res; } static int read_ras_event_all_cpus(struct pthread_data *pdata, unsigned int n_cpus) { ssize_t size; unsigned long long time_stamp; void *data; int ready, i, count_nready; struct kbuffer *kbuf; void *page; struct pollfd fds[n_cpus + 1]; struct signalfd_siginfo fdsiginfo; sigset_t mask; int warnonce[n_cpus]; char pipe_raw[PATH_MAX]; int legacy_kernel = 0; #if 0 int need_sleep = 0; #endif memset(&warnonce, 0, sizeof(warnonce)); page = malloc(pdata[0].ras->page_size); if (!page) { log(TERM, LOG_ERR, "Can't allocate page\n"); return -ENOMEM; } kbuf = kbuffer_alloc(KBUFFER_LSIZE_8, ENDIAN); if (!kbuf) { log(TERM, LOG_ERR, "Can't allocate kbuf\n"); free(page); return -ENOMEM; } /* Fix for poll() on the per_cpu trace_pipe and trace_pipe_raw blocks * indefinitely with the default buffer_percent in the kernel trace system, * which is introduced by the following change in the kernel. * https://lore.kernel.org/all/20221020231427.41be3f26@gandalf.local.home/T/#u. * Set buffer_percent to 0 so that poll() will return immediately * when the trace data is available in the ras per_cpu trace pipe_raw */ if (set_buffer_percent(pdata[0].ras, 0)) log(TERM, LOG_WARNING, "Set buffer_percent failed\n"); for (i = 0; i < (n_cpus + 1); i++) fds[i].fd = -1; for (i = 0; i < n_cpus; i++) { fds[i].events = POLLIN; /* FIXME: use select to open for all CPUs */ snprintf(pipe_raw, sizeof(pipe_raw), "per_cpu/cpu%d/trace_pipe_raw", i); fds[i].fd = open_trace(pdata[0].ras, pipe_raw, O_RDONLY); if (fds[i].fd < 0) { log(TERM, LOG_ERR, "Can't open trace_pipe_raw\n"); goto error; } } sigemptyset(&mask); sigaddset(&mask, SIGINT); sigaddset(&mask, SIGTERM); sigaddset(&mask, SIGHUP); sigaddset(&mask, SIGQUIT); if (sigprocmask(SIG_BLOCK, &mask, NULL) == -1) log(TERM, LOG_WARNING, "sigprocmask\n"); fds[n_cpus].events = POLLIN; fds[n_cpus].fd = signalfd(-1, &mask, 0); if (fds[n_cpus].fd < 0) { log(TERM, LOG_WARNING, "signalfd\n"); goto error; } log(TERM, LOG_INFO, "Listening to events for cpus 0 to %d\n", n_cpus - 1); if (pdata[0].ras->record_events) { if (ras_mc_event_opendb(pdata[0].cpu, pdata[0].ras)) goto error; #ifdef HAVE_NON_STANDARD if (ras_ns_add_vendor_tables(pdata[0].ras)) log(TERM, LOG_ERR, "Can't add vendor table\n"); #endif } do { ready = poll(fds, (n_cpus + 1), -1); if (ready < 0) { log(TERM, LOG_WARNING, "poll\n"); } /* check for the signal */ if (fds[n_cpus].revents & POLLIN) { size = read(fds[n_cpus].fd, &fdsiginfo, sizeof(struct signalfd_siginfo)); if (size != sizeof(struct signalfd_siginfo)) log(TERM, LOG_WARNING, "signalfd read\n"); if (fdsiginfo.ssi_signo == SIGINT || fdsiginfo.ssi_signo == SIGTERM || fdsiginfo.ssi_signo == SIGHUP || fdsiginfo.ssi_signo == SIGQUIT) { log(TERM, LOG_INFO, "Received signal=%d\n", fdsiginfo.ssi_signo); goto cleanup; } else { log(TERM, LOG_INFO, "Received unexpected signal=%d\n", fdsiginfo.ssi_signo); } } count_nready = 0; for (i = 0; i < n_cpus; i++) { if (fds[i].revents & POLLERR) { if (!warnonce[i]) { log(TERM, LOG_INFO, "Error on CPU %i\n", i); warnonce[i]++; #if 0 need_sleep = 1; #endif } } if (!(fds[i].revents & POLLIN)) { count_nready++; continue; } size = read(fds[i].fd, page, pdata[i].ras->page_size); if (size < 0) { log(TERM, LOG_WARNING, "read\n"); goto cleanup; } else if (size > 0) { kbuffer_load_subbuffer(kbuf, page); while ((data = kbuffer_read_event(kbuf, &time_stamp))) { if (kbuffer_curr_size(kbuf) < 0) { log(TERM, LOG_ERR, "invalid kbuf data, discard\n"); break; } parse_ras_data(&pdata[i], kbuf, data, time_stamp); /* increment to read next event */ kbuffer_next_event(kbuf, NULL); } } else { count_nready++; } } #if 0 if (need_sleep) sleep(POLLING_TIME); #else /* * If we enable fallback mode, it will always be used, as * poll is still not working fine, IMHO */ if (count_nready == n_cpus) { /* Should only happen with legacy kernels */ legacy_kernel = 1; break; } #endif } while (1); /* poll() is not supported. We need to fallback to the old way */ log(TERM, LOG_INFO, "Old kernel detected. Stop listening and fall back to pthread way.\n"); cleanup: if (pdata[0].ras->record_events) { #ifdef HAVE_NON_STANDARD ras_ns_finalize_vendor_tables(); #endif ras_mc_event_closedb(pdata[0].cpu, pdata[0].ras); } error: kbuffer_free(kbuf); free(page); sigprocmask(SIG_UNBLOCK, &mask, NULL); for (i = 0; i < (n_cpus + 1); i++) { if (fds[i].fd > 0) close(fds[i].fd); } if (legacy_kernel) return -255; else return -1; } static int read_ras_event(int fd, struct pthread_data *pdata, struct kbuffer *kbuf, void *page) { int size; unsigned long long time_stamp; void *data; /* * read() never blocks. We can't call poll() here, as it is * not supported on kernels below 3.10. So, the better is to just * sleep for a while, to avoid eating too much CPU here. */ do { size = read(fd, page, pdata->ras->page_size); if (size < 0) { log(TERM, LOG_WARNING, "read\n"); return -1; } else if (size > 0) { kbuffer_load_subbuffer(kbuf, page); while ((data = kbuffer_read_event(kbuf, &time_stamp))) { parse_ras_data(pdata, kbuf, data, time_stamp); /* increment to read next event */ kbuffer_next_event(kbuf, NULL); } } else { sleep(POLLING_TIME); } } while (1); } static void *handle_ras_events_cpu(void *priv) { int fd; struct kbuffer *kbuf; void *page; char pipe_raw[PATH_MAX]; struct pthread_data *pdata = priv; page = malloc(pdata->ras->page_size); if (!page) { log(TERM, LOG_ERR, "Can't allocate page\n"); return NULL; } kbuf = kbuffer_alloc(KBUFFER_LSIZE_8, ENDIAN); if (!kbuf) { log(TERM, LOG_ERR, "Can't allocate kbuf"); free(page); return NULL; } /* FIXME: use select to open for all CPUs */ snprintf(pipe_raw, sizeof(pipe_raw), "per_cpu/cpu%d/trace_pipe_raw", pdata->cpu); fd = open_trace(pdata->ras, pipe_raw, O_RDONLY); if (fd < 0) { log(TERM, LOG_ERR, "Can't open trace_pipe_raw\n"); kbuffer_free(kbuf); free(page); return NULL; } log(TERM, LOG_INFO, "Listening to events on cpu %d\n", pdata->cpu); if (pdata->ras->record_events) { pthread_mutex_lock(&pdata->ras->db_lock); if (ras_mc_event_opendb(pdata->cpu, pdata->ras)) { pthread_mutex_unlock(&pdata->ras->db_lock); log(TERM, LOG_ERR, "Can't open database\n"); close(fd); kbuffer_free(kbuf); free(page); return 0; } #ifdef HAVE_NON_STANDARD if (ras_ns_add_vendor_tables(pdata->ras)) log(TERM, LOG_ERR, "Can't add vendor table\n"); #endif pthread_mutex_unlock(&pdata->ras->db_lock); } read_ras_event(fd, pdata, kbuf, page); if (pdata->ras->record_events) { pthread_mutex_lock(&pdata->ras->db_lock); #ifdef HAVE_NON_STANDARD ras_ns_finalize_vendor_tables(); #endif ras_mc_event_closedb(pdata->cpu, pdata->ras); pthread_mutex_unlock(&pdata->ras->db_lock); } close(fd); kbuffer_free(kbuf); free(page); return NULL; } #define UPTIME "uptime" static int select_tracing_timestamp(struct ras_events *ras) { FILE *fp; int fd, rc; time_t uptime, now; size_t size; unsigned int j1; char buf[4096]; /* Check if uptime is supported (kernel 3.10-rc1 or upper) */ fd = open_trace(ras, "trace_clock", O_RDONLY); if (fd < 0) { log(TERM, LOG_ERR, "Can't open trace_clock\n"); return -1; } size = read(fd, buf, sizeof(buf)); close(fd); if (!size) { log(TERM, LOG_ERR, "trace_clock is empty!\n"); return -1; } if (!strstr(buf, UPTIME)) { log(TERM, LOG_INFO, "Kernel doesn't support uptime clock\n"); return 0; } /* Select uptime tracing */ fd = open_trace(ras, "trace_clock", O_WRONLY); if (!fd) { log(TERM, LOG_ERR, "Kernel didn't allow writing to trace_clock\n"); return 0; } rc = write(fd, UPTIME, sizeof(UPTIME)); close(fd); if (rc < 0) { log(TERM, LOG_ERR, "Kernel didn't allow selecting uptime on trace_clock\n"); return 0; } /* Reference uptime with localtime */ fp = fopen("/proc/uptime", "r"); if (!fp) { log(TERM, LOG_ERR, "Couldn't read from /proc/uptime\n"); return 0; } rc = fscanf(fp, "%zu.%u ", &uptime, &j1); fclose(fp); if (rc <= 0) { log(TERM, LOG_ERR, "Can't parse /proc/uptime!\n"); return -1; } now = time(NULL); ras->use_uptime = 1; ras->uptime_diff = now - uptime; return 0; } static int add_event_handler(struct ras_events *ras, struct tep_handle *pevent, unsigned int page_size, char *group, char *event, tep_event_handler_func func, char *filter_str, int id) { int fd, size, rc; char *page, fname[MAX_PATH + 1]; struct tep_event_filter *filter = NULL; snprintf(fname, sizeof(fname), "events/%s/%s/format", group, event); fd = open_trace(ras, fname, O_RDONLY); if (fd < 0) { log(TERM, LOG_ERR, "Can't get %s:%s traces. Perhaps this feature is not supported on your system.\n", group, event); return errno; } page = malloc(page_size); if (!page) { log(TERM, LOG_ERR, "Can't allocate page to read %s:%s format\n", group, event); rc = errno; close(fd); return rc; } size = read(fd, page, page_size); close(fd); if (size < 0) { log(TERM, LOG_ERR, "Can't get arch page size\n"); free(page); return size; } /* Registers the special event handlers */ rc = tep_register_event_handler(pevent, -1, group, event, func, ras); if (rc == TEP_ERRNO__MEM_ALLOC_FAILED) { log(TERM, LOG_ERR, "Can't register event handler for %s:%s\n", group, event); free(page); return EINVAL; } rc = tep_parse_event(pevent, page, size, group); if (rc) { log(TERM, LOG_ERR, "Can't parse event %s:%s\n", group, event); free(page); return EINVAL; } if (filter_str) { char error[255]; filter = tep_filter_alloc(pevent); if (!filter) { log(TERM, LOG_ERR, "Failed to allocate filter for %s/%s.\n", group, event); free(page); return EINVAL; } rc = tep_filter_add_filter_str(filter, filter_str); if (rc < 0) { tep_filter_strerror(filter, rc, error, sizeof(error)); log(TERM, LOG_ERR, "Failed to install filter for %s/%s: %s\n", group, event, error); tep_filter_free(filter); free(page); return rc; } } ras->filters[id] = filter; if (is_disabled_event(group, event)) { log(ALL, LOG_INFO, "Disabled %s:%s tracing from config\n", group, event); return -EINVAL; } /* Enable RAS events */ rc = __toggle_ras_mc_event(ras, group, event, 1); free(page); if (rc < 0) { log(TERM, LOG_ERR, "Can't enable %s:%s tracing\n", group, event); return EINVAL; } log(ALL, LOG_INFO, "Enabled event %s:%s\n", group, event); return 0; } int handle_ras_events(int record_events) { int rc, page_size, i; int num_events = 0; unsigned int cpus; struct tep_handle *pevent = NULL; struct pthread_data *data = NULL; struct ras_events *ras = NULL; #ifdef HAVE_DEVLINK char *filter_str = NULL; #endif ras = calloc(1, sizeof(*ras)); if (!ras) { log(TERM, LOG_ERR, "Can't allocate memory for ras struct\n"); return errno; } rc = get_tracing_dir(ras); if (rc < 0) { log(TERM, LOG_ERR, "Can't locate a mounted debugfs\n"); goto err; } rc = select_tracing_timestamp(ras); if (rc < 0) { log(TERM, LOG_ERR, "Can't select a timestamp for tracing\n"); goto err; } pevent = tep_alloc(); if (!pevent) { log(TERM, LOG_ERR, "Can't allocate pevent\n"); rc = errno; goto err; } page_size = get_pagesize(ras, pevent); ras->pevent = pevent; ras->page_size = page_size; ras->record_events = record_events; #ifdef HAVE_MEMORY_CE_PFA /* FIXME: enable memory isolation unconditionally */ ras_page_account_init(); #endif rc = add_event_handler(ras, pevent, page_size, "ras", "mc_event", ras_mc_event_handler, NULL, MC_EVENT); if (!rc) num_events++; else if (rc != -EINVAL) log(ALL, LOG_ERR, "Can't get traces from %s:%s\n", "ras", "mc_event"); #ifdef HAVE_AER rc = add_event_handler(ras, pevent, page_size, "ras", "aer_event", ras_aer_event_handler, NULL, AER_EVENT); if (!rc) num_events++; else if (rc != -EINVAL) log(ALL, LOG_ERR, "Can't get traces from %s:%s\n", "ras", "aer_event"); #endif #ifdef HAVE_NON_STANDARD rc = add_event_handler(ras, pevent, page_size, "ras", "non_standard_event", ras_non_standard_event_handler, NULL, NON_STANDARD_EVENT); if (!rc) num_events++; else if (rc != -EINVAL) log(ALL, LOG_ERR, "Can't get traces from %s:%s\n", "ras", "non_standard_event"); #endif #ifdef HAVE_ARM rc = add_event_handler(ras, pevent, page_size, "ras", "arm_event", ras_arm_event_handler, NULL, ARM_EVENT); if (!rc) num_events++; else if (rc != -EINVAL) log(ALL, LOG_ERR, "Can't get traces from %s:%s\n", "ras", "arm_event"); #endif cpus = get_num_cpus(ras); #ifdef HAVE_CPU_FAULT_ISOLATION ras_cpu_isolation_init(cpus); #endif #ifdef HAVE_MCE rc = register_mce_handler(ras, cpus); if (rc) log(ALL, LOG_INFO, "Can't register mce handler\n"); if (ras->mce_priv) { rc = add_event_handler(ras, pevent, page_size, "mce", "mce_record", ras_mce_event_handler, NULL, MCE_EVENT); if (!rc) num_events++; else log(ALL, LOG_ERR, "Can't get traces from %s:%s\n", "mce", "mce_record"); } #endif #ifdef HAVE_EXTLOG rc = add_event_handler(ras, pevent, page_size, "ras", "extlog_mem_event", ras_extlog_mem_event_handler, NULL, EXTLOG_EVENT); if (!rc) { /* tell kernel we are listening, so don't printk to console */ (void)open("/sys/kernel/debug/ras/daemon_active", 0); num_events++; } else if (rc != -EINVAL) log(ALL, LOG_ERR, "Can't get traces from %s:%s\n", "ras", "extlog_mem_event"); #endif #ifdef HAVE_DEVLINK rc = add_event_handler(ras, pevent, page_size, "net", "net_dev_xmit_timeout", ras_net_xmit_timeout_handler, NULL, DEVLINK_EVENT); if (!rc) filter_str = "devlink/devlink_health_report:msg=~\'TX timeout*\'"; rc = add_event_handler(ras, pevent, page_size, "devlink", "devlink_health_report", ras_devlink_event_handler, filter_str, DEVLINK_EVENT); if (!rc) num_events++; else if (rc != -EINVAL) log(ALL, LOG_ERR, "Can't get traces from %s:%s\n", "devlink", "devlink_health_report"); #endif #ifdef HAVE_DISKERROR #if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 18, 0) rc = add_event_handler(ras, pevent, page_size, "block", "block_rq_error", ras_diskerror_event_handler, NULL, DISKERROR_EVENT); if (!rc) num_events++; else if (rc != -EINVAL) log(ALL, LOG_ERR, "Can't get traces from %s:%s\n", "block", "block_rq_error"); #else rc = filter_ras_mc_event(ras, "block", "block_rq_complete", "error != 0"); if (!rc) { rc = add_event_handler(ras, pevent, page_size, "block", "block_rq_complete", ras_diskerror_event_handler, NULL, DISKERROR_EVENT); if (!rc) num_events++; else if (rc != -EINVAL) log(ALL, LOG_ERR, "Can't get traces from %s:%s\n", "block", "block_rq_complete"); } #endif #endif #ifdef HAVE_MEMORY_FAILURE rc = add_event_handler(ras, pevent, page_size, "ras", "memory_failure_event", ras_memory_failure_event_handler, NULL, MF_EVENT); if (!rc) num_events++; else if (rc != -EINVAL) log(ALL, LOG_ERR, "Can't get traces from %s:%s\n", "ras", "memory_failure_event"); #endif #ifdef HAVE_CXL rc = add_event_handler(ras, pevent, page_size, "cxl", "cxl_poison", ras_cxl_poison_event_handler, NULL, CXL_POISON_EVENT); if (!rc) num_events++; else if (rc != -EINVAL) log(ALL, LOG_ERR, "Can't get traces from %s:%s\n", "cxl", "cxl_poison"); rc = add_event_handler(ras, pevent, page_size, "cxl", "cxl_aer_uncorrectable_error", ras_cxl_aer_ue_event_handler, NULL, CXL_AER_UE_EVENT); if (!rc) num_events++; else if (rc != -EINVAL) log(ALL, LOG_ERR, "Can't get traces from %s:%s\n", "cxl", "cxl_aer_uncorrectable_error"); rc = add_event_handler(ras, pevent, page_size, "cxl", "cxl_aer_correctable_error", ras_cxl_aer_ce_event_handler, NULL, CXL_AER_CE_EVENT); if (!rc) num_events++; else if (rc != -EINVAL) log(ALL, LOG_ERR, "Can't get traces from %s:%s\n", "cxl", "cxl_aer_correctable_error"); rc = add_event_handler(ras, pevent, page_size, "cxl", "cxl_overflow", ras_cxl_overflow_event_handler, NULL, CXL_OVERFLOW_EVENT); if (!rc) num_events++; else if (rc != -EINVAL) log(ALL, LOG_ERR, "Can't get traces from %s:%s\n", "cxl", "cxl_overflow"); rc = add_event_handler(ras, pevent, page_size, "cxl", "cxl_generic_event", ras_cxl_generic_event_handler, NULL, CXL_GENERIC_EVENT); if (!rc) num_events++; else if (rc != -EINVAL) log(ALL, LOG_ERR, "Can't get traces from %s:%s\n", "cxl", "cxl_generic_event"); rc = add_event_handler(ras, pevent, page_size, "cxl", "cxl_general_media", ras_cxl_general_media_event_handler, NULL, CXL_GENERAL_MEDIA_EVENT); if (!rc) num_events++; else if (rc != -EINVAL) log(ALL, LOG_ERR, "Can't get traces from %s:%s\n", "cxl", "cxl_general_media"); rc = add_event_handler(ras, pevent, page_size, "cxl", "cxl_dram", ras_cxl_dram_event_handler, NULL, CXL_DRAM_EVENT); if (!rc) num_events++; else if (rc != -EINVAL) log(ALL, LOG_ERR, "Can't get traces from %s:%s\n", "cxl", "cxl_dram"); rc = add_event_handler(ras, pevent, page_size, "cxl", "cxl_memory_module", ras_cxl_memory_module_event_handler, NULL, CXL_MEMORY_MODULE_EVENT); if (!rc) num_events++; else if (rc != -EINVAL) log(ALL, LOG_ERR, "Can't get traces from %s:%s\n", "cxl", "memory_module"); #endif if (!num_events) { log(ALL, LOG_INFO, "Failed to trace all supported RAS events. Aborting.\n"); rc = -EINVAL; goto err; } data = calloc(sizeof(*data), cpus); if (!data) goto err; for (i = 0; i < cpus; i++) { data[i].ras = ras; data[i].cpu = i; } rc = read_ras_event_all_cpus(data, cpus); /* Poll doesn't work on this kernel. Fallback to pthread way */ if (rc == -255) { if (pthread_mutex_init(&ras->db_lock, NULL) != 0) { log(SYSLOG, LOG_INFO, "sqlite db lock init has failed\n"); goto err; } log(SYSLOG, LOG_INFO, "Opening one thread per cpu (%d threads)\n", cpus); for (i = 0; i < cpus; i++) { rc = pthread_create(&data[i].thread, NULL, handle_ras_events_cpu, (void *)&data[i]); if (rc) { log(SYSLOG, LOG_INFO, "Failed to create thread for cpu %d. Aborting.\n", i); while (--i) pthread_cancel(data[i].thread); pthread_mutex_destroy(&ras->db_lock); goto err; } } /* Wait for all threads to complete */ for (i = 0; i < cpus; i++) pthread_join(data[i].thread, NULL); pthread_mutex_destroy(&ras->db_lock); } log(SYSLOG, LOG_INFO, "Huh! something got wrong. Aborting.\n"); err: if (data) free(data); if (pevent) tep_free(pevent); if (ras) { for (i = 0; i < NR_EVENTS; i++) { if (ras->filters[i]) tep_filter_free(ras->filters[i]); } free(ras); } #ifdef HAVE_CPU_FAULT_ISOLATION cpu_infos_free(); #endif return rc; } 07070100000053000081A400000000000000000000000165C04BE400000A28000000000000000000000000000000000000002C00000000rasdaemon-0.8.0.49.git+f9cb13b/ras-events.h/* * Copyright (C) 2013 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #ifndef __RAS_EVENTS_H #define __RAS_EVENTS_H #include "ras-record.h" #include <pthread.h> #include <time.h> #define MAX_PATH 1024 #define STR(x) #x struct mce_priv; enum { MC_EVENT, MCE_EVENT, AER_EVENT, NON_STANDARD_EVENT, ARM_EVENT, EXTLOG_EVENT, DEVLINK_EVENT, DISKERROR_EVENT, MF_EVENT, CXL_POISON_EVENT, CXL_AER_UE_EVENT, CXL_AER_CE_EVENT, CXL_OVERFLOW_EVENT, CXL_GENERIC_EVENT, CXL_GENERAL_MEDIA_EVENT, CXL_DRAM_EVENT, CXL_MEMORY_MODULE_EVENT, NR_EVENTS }; struct ras_events { char debugfs[MAX_PATH + 1]; char tracing[MAX_PATH + 1]; struct tep_handle *pevent; int page_size; /* Booleans */ unsigned use_uptime: 1; unsigned record_events: 1; /* For timestamp */ time_t uptime_diff; /* For ras-record */ void *db_priv; int db_ref_count; pthread_mutex_t db_lock; /* For the mce handler */ struct mce_priv *mce_priv; /* For ABRT socket*/ int socketfd; struct tep_event_filter *filters[NR_EVENTS]; }; struct pthread_data { pthread_t thread; struct tep_handle *pevent; struct ras_events *ras; int cpu; }; /* Should match the code at Kernel's include/linux/edac.c */ enum hw_event_mc_err_type { HW_EVENT_ERR_CORRECTED, HW_EVENT_ERR_UNCORRECTED, HW_EVENT_ERR_FATAL, HW_EVENT_ERR_INFO, }; /* Should match the code at Kernel's /drivers/pci/pcie/aer/aerdrv_errprint.c */ enum hw_event_aer_err_type { HW_EVENT_AER_UNCORRECTED_NON_FATAL, HW_EVENT_AER_UNCORRECTED_FATAL, HW_EVENT_AER_CORRECTED, }; /* Should match the code at Kernel's include/acpi/ghes.h */ enum ghes_severity { GHES_SEV_NO, GHES_SEV_CORRECTED, GHES_SEV_RECOVERABLE, GHES_SEV_PANIC, }; /* Function prototypes */ int toggle_ras_mc_event(int enable); int ras_offline_mce_event(struct ras_mc_offline_event *event); int handle_ras_events(int record_events); #endif 07070100000054000081A400000000000000000000000165C04BE400001CE3000000000000000000000000000000000000003400000000rasdaemon-0.8.0.49.git+f9cb13b/ras-extlog-handler.c/* * Copyright (C) 2014 Tony Luck <tony.luck@intel.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <ctype.h> #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <stdint.h> #include <traceevent/kbuffer.h> #include "ras-extlog-handler.h" #include "ras-record.h" #include "ras-logger.h" #include "ras-report.h" static char *err_type(int etype) { switch (etype) { case 0: return "unknown"; case 1: return "no error"; case 2: return "single-bit ECC"; case 3: return "multi-bit ECC"; case 4: return "single-symbol chipkill ECC"; case 5: return "multi-symbol chipkill ECC"; case 6: return "master abort"; case 7: return "target abort"; case 8: return "parity error"; case 9: return "watchdog timeout"; case 10: return "invalid address"; case 11: return "mirror Broken"; case 12: return "memory sparing"; case 13: return "scrub corrected error"; case 14: return "scrub uncorrected error"; case 15: return "physical memory map-out event"; } return "unknown-type"; } static char *err_severity(int severity) { switch (severity) { case 0: return "recoverable"; case 1: return "fatal"; case 2: return "corrected"; case 3: return "informational"; } return "unknown-severity"; } static unsigned long long err_mask(int lsb) { if (lsb == 0xff) return ~0ull; return ~((1ull << lsb) - 1); } #define CPER_MEM_VALID_NODE 0x0008 #define CPER_MEM_VALID_CARD 0x0010 #define CPER_MEM_VALID_MODULE 0x0020 #define CPER_MEM_VALID_BANK 0x0040 #define CPER_MEM_VALID_DEVICE 0x0080 #define CPER_MEM_VALID_ROW 0x0100 #define CPER_MEM_VALID_COLUMN 0x0200 #define CPER_MEM_VALID_BIT_POSITION 0x0400 #define CPER_MEM_VALID_REQUESTOR_ID 0x0800 #define CPER_MEM_VALID_RESPONDER_ID 0x1000 #define CPER_MEM_VALID_TARGET_ID 0x2000 #define CPER_MEM_VALID_RANK_NUMBER 0x8000 #define CPER_MEM_VALID_CARD_HANDLE 0x10000 #define CPER_MEM_VALID_MODULE_HANDLE 0x20000 struct cper_mem_err_compact { unsigned long long validation_bits; unsigned short node; unsigned short card; unsigned short module; unsigned short bank; unsigned short device; unsigned short row; unsigned short column; unsigned short bit_pos; unsigned long long requestor_id; unsigned long long responder_id; unsigned long long target_id; unsigned short rank; unsigned short mem_array_handle; unsigned short mem_dev_handle; }; static char *err_cper_data(const char *c) { const struct cper_mem_err_compact *cpd = (struct cper_mem_err_compact *)c; static char buf[256]; char *p = buf; if (cpd->validation_bits == 0) return ""; p += sprintf(p, " ("); if (cpd->validation_bits & CPER_MEM_VALID_NODE) p += sprintf(p, "node: %d ", cpd->node); if (cpd->validation_bits & CPER_MEM_VALID_CARD) p += sprintf(p, "card: %d ", cpd->card); if (cpd->validation_bits & CPER_MEM_VALID_MODULE) p += sprintf(p, "module: %d ", cpd->module); if (cpd->validation_bits & CPER_MEM_VALID_BANK) p += sprintf(p, "bank: %d ", cpd->bank); if (cpd->validation_bits & CPER_MEM_VALID_DEVICE) p += sprintf(p, "device: %d ", cpd->device); if (cpd->validation_bits & CPER_MEM_VALID_ROW) p += sprintf(p, "row: %d ", cpd->row); if (cpd->validation_bits & CPER_MEM_VALID_COLUMN) p += sprintf(p, "column: %d ", cpd->column); if (cpd->validation_bits & CPER_MEM_VALID_BIT_POSITION) p += sprintf(p, "bit_pos: %d ", cpd->bit_pos); if (cpd->validation_bits & CPER_MEM_VALID_REQUESTOR_ID) p += sprintf(p, "req_id: 0x%llx ", cpd->requestor_id); if (cpd->validation_bits & CPER_MEM_VALID_RESPONDER_ID) p += sprintf(p, "resp_id: 0x%llx ", cpd->responder_id); if (cpd->validation_bits & CPER_MEM_VALID_TARGET_ID) p += sprintf(p, "tgt_id: 0x%llx ", cpd->target_id); if (cpd->validation_bits & CPER_MEM_VALID_RANK_NUMBER) p += sprintf(p, "rank: %d ", cpd->rank); if (cpd->validation_bits & CPER_MEM_VALID_CARD_HANDLE) p += sprintf(p, "card_handle: %d ", cpd->mem_array_handle); if (cpd->validation_bits & CPER_MEM_VALID_MODULE_HANDLE) p += sprintf(p, "module_handle: %d ", cpd->mem_dev_handle); p += sprintf(p - 1, ")"); return buf; } static char *uuid_le(const char *uu) { static char uuid[sizeof("xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx")]; char *p = uuid; int i; static const unsigned char le[16] = {3, 2, 1, 0, 5, 4, 7, 6, 8, 9, 10, 11, 12, 13, 14, 15}; for (i = 0; i < 16; i++) { p += sprintf(p, "%.2x", (unsigned char)uu[le[i]]); switch (i) { case 3: case 5: case 7: case 9: *p++ = '-'; break; } } *p = 0; return uuid; } static void report_extlog_mem_event(struct ras_events *ras, struct tep_record *record, struct trace_seq *s, struct ras_extlog_event *ev) { trace_seq_printf(s, "%d %s error: %s physical addr: 0x%llx mask: 0x%llx%s %s %s", ev->error_seq, err_severity(ev->severity), err_type(ev->etype), ev->address, err_mask(ev->pa_mask_lsb), err_cper_data(ev->cper_data), ev->fru_text, uuid_le(ev->fru_id)); } int ras_extlog_mem_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) { int len; unsigned long long val; struct ras_events *ras = context; time_t now; struct tm *tm; struct ras_extlog_event ev; /* * Newer kernels (3.10-rc1 or upper) provide an uptime clock. * On previous kernels, the way to properly generate an event would * be to inject a fake one, measure its timestamp and diff it against * gettimeofday. We won't do it here. Instead, let's use uptime, * falling-back to the event report's time, if "uptime" clock is * not available (legacy kernels). */ if (ras->use_uptime) now = record->ts / user_hz + ras->uptime_diff; else now = time(NULL); tm = localtime(&now); if (tm) strftime(ev.timestamp, sizeof(ev.timestamp), "%Y-%m-%d %H:%M:%S %z", tm); trace_seq_printf(s, "%s ", ev.timestamp); if (tep_get_field_val(s, event, "etype", record, &val, 1) < 0) return -1; ev.etype = val; if (tep_get_field_val(s, event, "err_seq", record, &val, 1) < 0) return -1; ev.error_seq = val; if (tep_get_field_val(s, event, "sev", record, &val, 1) < 0) return -1; ev.severity = val; if (tep_get_field_val(s, event, "pa", record, &val, 1) < 0) return -1; ev.address = val; if (tep_get_field_val(s, event, "pa_mask_lsb", record, &val, 1) < 0) return -1; ev.pa_mask_lsb = val; ev.cper_data = tep_get_field_raw(s, event, "data", record, &len, 1); ev.cper_data_length = len; ev.fru_text = tep_get_field_raw(s, event, "fru_text", record, &len, 1); ev.fru_id = tep_get_field_raw(s, event, "fru_id", record, &len, 1); report_extlog_mem_event(ras, record, s, &ev); ras_store_extlog_mem_record(ras, &ev); return 0; } 07070100000055000081A400000000000000000000000165C04BE400000435000000000000000000000000000000000000003400000000rasdaemon-0.8.0.49.git+f9cb13b/ras-extlog-handler.h/* * Copyright (C) 2014 Tony Luck <tony.luck@intel.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #ifndef __RAS_EXTLOG_HANDLER_H #define __RAS_EXTLOG_HANDLER_H #include <stdint.h> #include "ras-events.h" #include <traceevent/event-parse.h> extern int ras_extlog_mem_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context); #endif 07070100000056000081A400000000000000000000000165C04BE400000503000000000000000000000000000000000000002C00000000rasdaemon-0.8.0.49.git+f9cb13b/ras-logger.h/* * Copyright (C) 2013 Petr Holasek <pholasek@redhat.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #ifndef __RAS_LOGGER_H #include <syslog.h> /* * Logging macros */ #ifndef TOOL_NAME #define TOOL_NAME "rasdaemon" #endif #define SYSLOG (1 << 0) #define TERM (1 << 1) #define ALL (SYSLOG | TERM) /* TODO: global logging limit mask */ #define log(where, level, fmt, args...) do {\ if (where & SYSLOG)\ syslog(level, fmt, ##args);\ if (where & TERM) {\ fprintf(stderr, "%s: ", TOOL_NAME);\ fprintf(stderr, fmt, ##args);\ fflush(stderr);\ }\ } while (0) #define __RAS_LOGGER_H #endif 07070100000057000081A400000000000000000000000165C04BE400001538000000000000000000000000000000000000003000000000rasdaemon-0.8.0.49.git+f9cb13b/ras-mc-handler.c/* * Copyright (C) 2013 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <traceevent/kbuffer.h> #include "ras-mc-handler.h" #include "ras-record.h" #include "ras-logger.h" #include "ras-page-isolation.h" #include "ras-report.h" int ras_mc_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) { int len; unsigned long long val; struct ras_events *ras = context; time_t now; struct tm *tm; struct ras_mc_event ev; int parsed_fields = 0; /* * Newer kernels (3.10-rc1 or upper) provide an uptime clock. * On previous kernels, the way to properly generate an event would * be to inject a fake one, measure its timestamp and diff it against * gettimeofday. We won't do it here. Instead, let's use uptime, * falling-back to the event report's time, if "uptime" clock is * not available (legacy kernels). */ if (ras->use_uptime) now = record->ts / user_hz + ras->uptime_diff; else now = time(NULL); tm = localtime(&now); if (tm) strftime(ev.timestamp, sizeof(ev.timestamp), "%Y-%m-%d %H:%M:%S %z", tm); trace_seq_printf(s, "%s ", ev.timestamp); if (tep_get_field_val(s, event, "error_count", record, &val, 1) < 0) goto parse_error; parsed_fields++; ev.error_count = val; trace_seq_printf(s, "%d ", ev.error_count); if (tep_get_field_val(s, event, "error_type", record, &val, 1) < 0) goto parse_error; parsed_fields++; switch (val) { case HW_EVENT_ERR_CORRECTED: ev.error_type = "Corrected"; break; case HW_EVENT_ERR_UNCORRECTED: ev.error_type = "Uncorrected"; break; case HW_EVENT_ERR_FATAL: ev.error_type = "Fatal"; break; default: case HW_EVENT_ERR_INFO: ev.error_type = "Info"; } trace_seq_puts(s, ev.error_type); if (ev.error_count > 1) trace_seq_puts(s, " errors:"); else trace_seq_puts(s, " error:"); ev.msg = tep_get_field_raw(s, event, "msg", record, &len, 1); if (!ev.msg) goto parse_error; parsed_fields++; if (*ev.msg) { trace_seq_puts(s, " "); trace_seq_puts(s, ev.msg); } ev.label = tep_get_field_raw(s, event, "label", record, &len, 1); if (!ev.label) goto parse_error; parsed_fields++; if (*ev.label) { trace_seq_puts(s, " on "); trace_seq_puts(s, ev.label); } trace_seq_puts(s, " ("); if (tep_get_field_val(s, event, "mc_index", record, &val, 1) < 0) goto parse_error; parsed_fields++; ev.mc_index = val; trace_seq_printf(s, "mc: %d", ev.mc_index); if (tep_get_field_val(s, event, "top_layer", record, &val, 1) < 0) goto parse_error; parsed_fields++; ev.top_layer = (signed char)val; if (tep_get_field_val(s, event, "middle_layer", record, &val, 1) < 0) goto parse_error; parsed_fields++; ev.middle_layer = (signed char)val; if (tep_get_field_val(s, event, "lower_layer", record, &val, 1) < 0) goto parse_error; parsed_fields++; ev.lower_layer = (signed char)val; if (ev.top_layer >= 0 || ev.middle_layer >= 0 || ev.lower_layer >= 0) { if (ev.lower_layer >= 0) trace_seq_printf(s, " location: %d:%d:%d", ev.top_layer, ev.middle_layer, ev.lower_layer); else if (ev.middle_layer >= 0) trace_seq_printf(s, " location: %d:%d", ev.top_layer, ev.middle_layer); else trace_seq_printf(s, " location: %d", ev.top_layer); } if (tep_get_field_val(s, event, "address", record, &val, 1) < 0) goto parse_error; parsed_fields++; ev.address = val; if (ev.address) trace_seq_printf(s, " address: 0x%08llx", ev.address); if (tep_get_field_val(s, event, "grain_bits", record, &val, 1) < 0) goto parse_error; parsed_fields++; ev.grain = val; trace_seq_printf(s, " grain: %lld", ev.grain); if (tep_get_field_val(s, event, "syndrome", record, &val, 1) < 0) goto parse_error; parsed_fields++; ev.syndrome = val; if (val) trace_seq_printf(s, " syndrome: 0x%08llx", ev.syndrome); ev.driver_detail = tep_get_field_raw(s, event, "driver_detail", record, &len, 1); if (!ev.driver_detail) goto parse_error; parsed_fields++; if (*ev.driver_detail) { trace_seq_puts(s, " "); trace_seq_puts(s, ev.driver_detail); } trace_seq_puts(s, ")"); /* Insert data into the SGBD */ ras_store_mc_event(ras, &ev); #ifdef HAVE_MEMORY_CE_PFA /* Account page corrected errors */ if (!strcmp(ev.error_type, "Corrected")) ras_record_page_error(ev.address, ev.error_count, now); #endif #ifdef HAVE_ABRT_REPORT /* Report event to ABRT */ ras_report_mc_event(ras, &ev); #endif return 0; parse_error: /* FIXME: add a logic here to also store parse errors to SDBD */ log(ALL, LOG_ERR, "MC error handler: can't parse field #%d\n", parsed_fields); return 0; } 07070100000058000081A400000000000000000000000165C04BE400000414000000000000000000000000000000000000003000000000rasdaemon-0.8.0.49.git+f9cb13b/ras-mc-handler.h/* * Copyright (C) 2013 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #ifndef __RAS_MC_HANDLER_H #define __RAS_MC_HANDLER_H #include "ras-events.h" #include <traceevent/event-parse.h> int ras_mc_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context); #endif 07070100000059000081A400000000000000000000000165C04BE400003EC4000000000000000000000000000000000000003100000000rasdaemon-0.8.0.49.git+f9cb13b/ras-mce-handler.c/* * Copyright (C) 2013 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <ctype.h> #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <stdint.h> #include <traceevent/kbuffer.h> #include "ras-mce-handler.h" #include "ras-record.h" #include "ras-logger.h" #include "ras-report.h" /* * The code below were adapted from Andi Kleen/Intel/SuSe mcelog code, * released under GNU Public General License, v.2 */ static char *cputype_name[] = { [CPU_GENERIC] = "generic CPU", [CPU_P6OLD] = "Intel PPro/P2/P3/old Xeon", [CPU_CORE2] = "Intel Core", /* 65nm and 45nm */ [CPU_K8] = "AMD K8 and derivates", [CPU_P4] = "Intel P4", [CPU_NEHALEM] = "Intel Xeon 5500 series / Core i3/5/7 (\"Nehalem/Westmere\")", [CPU_DUNNINGTON] = "Intel Xeon 7400 series", [CPU_TULSA] = "Intel Xeon 7100 series", [CPU_INTEL] = "Intel generic architectural MCA", [CPU_XEON75XX] = "Intel Xeon 7500 series", [CPU_SANDY_BRIDGE] = "Sandy Bridge", /* Fill in better name */ [CPU_SANDY_BRIDGE_EP] = "Sandy Bridge EP", /* Fill in better name */ [CPU_IVY_BRIDGE] = "Ivy Bridge", /* Fill in better name */ [CPU_IVY_BRIDGE_EPEX] = "Ivy Bridge EP/EX", /* Fill in better name */ [CPU_HASWELL] = "Haswell", [CPU_HASWELL_EPEX] = "Intel Xeon v3 (Haswell) EP/EX", [CPU_BROADWELL] = "Broadwell", [CPU_BROADWELL_DE] = "Broadwell DE", [CPU_BROADWELL_EPEX] = "Broadwell EP/EX", [CPU_KNIGHTS_LANDING] = "Knights Landing", [CPU_KNIGHTS_MILL] = "Knights Mill", [CPU_SKYLAKE_XEON] = "Skylake server", [CPU_AMD_SMCA] = "AMD Scalable MCA", [CPU_DHYANA] = "Hygon Family 18h Moksha", [CPU_ICELAKE_XEON] = "Icelake server", [CPU_ICELAKE_DE] = "Icelake server D Family", [CPU_TREMONT_D] = "Tremont microserver", [CPU_SAPPHIRERAPIDS] = "Sapphirerapids server", [CPU_EMERALDRAPIDS] = "Emeraldrapids server", }; static enum cputype select_intel_cputype(struct mce_priv *mce) { if (mce->family == 15) { if (mce->model == 6) return CPU_TULSA; return CPU_P4; } if (mce->family == 6) { if (mce->model >= 0x1a && mce->model != 28) mce->mc_error_support = 1; if (mce->model < 0xf) return CPU_P6OLD; else if (mce->model == 0xf || mce->model == 0x17) /* Merom/Penryn */ return CPU_CORE2; else if (mce->model == 0x1d) return CPU_DUNNINGTON; else if (mce->model == 0x1a || mce->model == 0x2c || mce->model == 0x1e || mce->model == 0x25) return CPU_NEHALEM; else if (mce->model == 0x2e || mce->model == 0x2f) return CPU_XEON75XX; else if (mce->model == 0x2a) return CPU_SANDY_BRIDGE; else if (mce->model == 0x2d) return CPU_SANDY_BRIDGE_EP; else if (mce->model == 0x3a) return CPU_IVY_BRIDGE; else if (mce->model == 0x3e) return CPU_IVY_BRIDGE_EPEX; else if (mce->model == 0x3c || mce->model == 0x45 || mce->model == 0x46) return CPU_HASWELL; else if (mce->model == 0x3f) return CPU_HASWELL_EPEX; else if (mce->model == 0x56) return CPU_BROADWELL_DE; else if (mce->model == 0x4f) return CPU_BROADWELL_EPEX; else if (mce->model == 0x3d) return CPU_BROADWELL; else if (mce->model == 0x57) return CPU_KNIGHTS_LANDING; else if (mce->model == 0x85) return CPU_KNIGHTS_MILL; else if (mce->model == 0x55) return CPU_SKYLAKE_XEON; else if (mce->model == 0x6a) return CPU_ICELAKE_XEON; else if (mce->model == 0x6c) return CPU_ICELAKE_DE; else if (mce->model == 0x86) return CPU_TREMONT_D; else if (mce->model == 0x8f) return CPU_SAPPHIRERAPIDS; else if (mce->model == 0xcf) return CPU_EMERALDRAPIDS; if (mce->model > 0x1a) { log(ALL, LOG_INFO, "Family 6 Model %x CPU: only decoding architectural errors\n", mce->model); return CPU_INTEL; } } if (mce->family > 6) { log(ALL, LOG_INFO, "Family %u Model %x CPU: only decoding architectural errors\n", mce->family, mce->model); return CPU_INTEL; } log(ALL, LOG_INFO, "Unknown Intel CPU type Family %x Model %x\n", mce->family, mce->model); return mce->family == 6 ? CPU_P6OLD : CPU_GENERIC; } static int detect_cpu(struct mce_priv *mce) { FILE *f; int ret = 0; char *line = NULL; size_t linelen = 0; enum { CPU_VENDOR = 1, CPU_FAMILY = 2, CPU_MODEL = 4, CPU_MHZ = 8, CPU_FLAGS = 16, CPU_ALL = 0x1f } seen = 0; mce->family = 0; mce->model = 0; mce->mhz = 0; mce->vendor[0] = '\0'; f = fopen("/proc/cpuinfo", "r"); if (!f) { log(ALL, LOG_INFO, "Can't open /proc/cpuinfo\n"); return errno; } while (seen != CPU_ALL && getdelim(&line, &linelen, '\n', f) > 0) { if (sscanf(line, "vendor_id : %63[^\n]", (char *)&mce->vendor) == 1) seen |= CPU_VENDOR; else if (sscanf(line, "cpu family : %d", &mce->family) == 1) seen |= CPU_FAMILY; else if (sscanf(line, "model : %d", &mce->model) == 1) seen |= CPU_MODEL; else if (sscanf(line, "cpu MHz : %lf", &mce->mhz) == 1) seen |= CPU_MHZ; else if (!strncmp(line, "flags", 5) && isspace(line[6])) { if (mce->processor_flags) free(mce->processor_flags); mce->processor_flags = line; line = NULL; linelen = 0; seen |= CPU_FLAGS; } } if (seen != CPU_ALL) { log(ALL, LOG_INFO, "Can't parse /proc/cpuinfo: missing%s%s%s%s%s\n", (seen & CPU_VENDOR) ? "" : " [vendor_id]", (seen & CPU_FAMILY) ? "" : " [cpu family]", (seen & CPU_MODEL) ? "" : " [model]", (seen & CPU_MHZ) ? "" : " [cpu MHz]", (seen & CPU_FLAGS) ? "" : " [flags]"); ret = EINVAL; goto ret; } /* Handle only Intel and AMD CPUs */ ret = 0; if (!strcmp(mce->vendor, "AuthenticAMD")) { if (mce->family == 15) mce->cputype = CPU_K8; if (strstr(mce->processor_flags, "smca")) { mce->cputype = CPU_AMD_SMCA; goto ret; } if (mce->family > 25) { log(ALL, LOG_INFO, "Can't parse MCE for this AMD CPU yet %d\n", mce->family); ret = EINVAL; } goto ret; } else if (!strcmp(mce->vendor, "HygonGenuine")) { if (mce->family == 24) { mce->cputype = CPU_DHYANA; } goto ret; } else if (!strcmp(mce->vendor, "GenuineIntel")) { mce->cputype = select_intel_cputype(mce); } else { ret = EINVAL; } ret: fclose(f); free(line); return ret; } int register_mce_handler(struct ras_events *ras, unsigned int ncpus) { int rc; struct mce_priv *mce; ras->mce_priv = calloc(1, sizeof(struct mce_priv)); if (!ras->mce_priv) { log(ALL, LOG_INFO, "Can't allocate memory MCE data\n"); return ENOMEM; } mce = ras->mce_priv; rc = detect_cpu(mce); if (rc) { if (mce->processor_flags) free(mce->processor_flags); free(ras->mce_priv); ras->mce_priv = NULL; return (rc); } switch (mce->cputype) { case CPU_SANDY_BRIDGE_EP: case CPU_IVY_BRIDGE_EPEX: case CPU_HASWELL_EPEX: case CPU_KNIGHTS_LANDING: case CPU_KNIGHTS_MILL: set_intel_imc_log(mce->cputype, ncpus); default: break; } return rc; } /* * End of mcelog's code */ static void report_mce_event(struct ras_events *ras, struct tep_record *record, struct trace_seq *s, struct mce_event *e) { time_t now; struct tm *tm; struct mce_priv *mce = ras->mce_priv; /* * Newer kernels (3.10-rc1 or upper) provide an uptime clock. * On previous kernels, the way to properly generate an event would * be to inject a fake one, measure its timestamp and diff it against * gettimeofday. We won't do it here. Instead, let's use uptime, * falling-back to the event report's time, if "uptime" clock is * not available (legacy kernels). */ if (ras->use_uptime) now = record->ts / user_hz + ras->uptime_diff; else now = time(NULL); tm = localtime(&now); if (tm) strftime(e->timestamp, sizeof(e->timestamp), "%Y-%m-%d %H:%M:%S %z", tm); trace_seq_printf(s, "%s ", e->timestamp); if (*e->bank_name) trace_seq_printf(s, "%s", e->bank_name); else trace_seq_printf(s, "bank=%x", e->bank); trace_seq_printf(s, ", status= %llx", (long long)e->status); if (*e->error_msg) trace_seq_printf(s, ", %s", e->error_msg); if (*e->mcistatus_msg) trace_seq_printf(s, ", mci=%s", e->mcistatus_msg); if (*e->mcastatus_msg) trace_seq_printf(s, ", mca=%s", e->mcastatus_msg); if (*e->user_action) trace_seq_printf(s, " %s", e->user_action); if (*e->mc_location) trace_seq_printf(s, ", %s", e->mc_location); #if 0 /* * While the logic for decoding tsc is there at mcelog, why to * decode/print it, if we already got the uptime from the * tracing event? Let's just discard it for now. */ trace_seq_printf(s, ", tsc= %d", e->tsc); trace_seq_printf(s, ", walltime= %d", e->walltime); #endif trace_seq_printf(s, ", cpu_type= %s", cputype_name[mce->cputype]); trace_seq_printf(s, ", cpu= %d", e->cpu); trace_seq_printf(s, ", socketid= %d", e->socketid); #if 0 /* * The CPU vendor is already reported from mce->cputype */ trace_seq_printf(s, ", cpuvendor= %d", e->cpuvendor); trace_seq_printf(s, ", cpuid= %d", e->cpuid); #endif if (e->ip) trace_seq_printf(s, ", ip= %llx%s", (long long)e->ip, !(e->mcgstatus & MCG_STATUS_EIPV) ? " (INEXACT)" : ""); if (e->cs) trace_seq_printf(s, ", cs= %x", e->cs); if (e->status & MCI_STATUS_MISCV) trace_seq_printf(s, ", misc= %llx", (long long)e->misc); if (e->status & MCI_STATUS_ADDRV) trace_seq_printf(s, ", addr= %llx", (long long)e->addr); if (e->status & MCI_STATUS_SYNDV) trace_seq_printf(s, ", synd= %llx", (long long)e->synd); if (e->ipid) trace_seq_printf(s, ", ipid= %llx", (long long)e->ipid); if (*e->mcgstatus_msg) trace_seq_printf(s, ", %s", e->mcgstatus_msg); else trace_seq_printf(s, ", mcgstatus= %llx", (long long)e->mcgstatus); if (e->mcgcap) trace_seq_printf(s, ", mcgcap= %llx", (long long)e->mcgcap); trace_seq_printf(s, ", apicid= %x", e->apicid); if (!e->vdata_len) return; if (strlen(e->frutext)) { trace_seq_printf(s, ", FRU Text= %s", e->frutext); trace_seq_printf(s, ", Vendor Data= "); for (int i = 2; i < e->vdata_len / 8; i++) { trace_seq_printf(s, "0x%lx", e->vdata[i]); trace_seq_printf(s, " "); } } else { trace_seq_printf(s, ", Vendor Data= "); for (int i = 0; i < e->vdata_len / 8; i++) { trace_seq_printf(s, "0x%lx", e->vdata[i]); trace_seq_printf(s, " "); } } /* * FIXME: The original mcelog userspace tool uses DMI to map from * address to DIMM. From the comments there, the code there doesn't * take interleaving sets into account. Also, it is known that * BIOS is generally not reliable enough to associate DIMM labels * with addresses. * As, in thesis, we shouldn't be receiving memory error reports via * MCE, as they should go via EDAC traces, let's not do it. */ } static int report_mce_offline(struct trace_seq *s, struct mce_event *mce, struct mce_priv *priv) { time_t now; struct tm *tm; time(&now); tm = localtime(&now); if (tm) strftime(mce->timestamp, sizeof(mce->timestamp), "%Y-%m-%d %H:%M:%S %z", tm); trace_seq_printf(s, "%s,", mce->timestamp); if (*mce->bank_name) trace_seq_printf(s, " %s,", mce->bank_name); else trace_seq_printf(s, " bank=%x,", mce->bank); if (*mce->mcastatus_msg) trace_seq_printf(s, " mca: %s,", mce->mcastatus_msg); if (*mce->mcistatus_msg) trace_seq_printf(s, " mci: %s,", mce->mcistatus_msg); if (*mce->mc_location) trace_seq_printf(s, " Locn: %s,", mce->mc_location); if (*mce->error_msg) trace_seq_printf(s, " Error Msg: %s\n", mce->error_msg); return 0; } int ras_offline_mce_event(struct ras_mc_offline_event *event) { int rc = 0; struct trace_seq s; struct mce_event *mce = NULL; struct mce_priv *priv = NULL; mce = (struct mce_event *)calloc(1, sizeof(struct mce_event)); if (!mce) { log(TERM, LOG_ERR, "Can't allocate memory for mce struct\n"); return errno; } priv = (struct mce_priv *)calloc(1, sizeof(struct mce_priv)); if (!priv) { log(TERM, LOG_ERR, "Can't allocate memory for mce_priv struct\n"); free(mce); return errno; } if (event->smca) { priv->cputype = CPU_AMD_SMCA; priv->family = event->family; priv->model = event->model; } else { rc = detect_cpu(priv); if (rc) { log(TERM, LOG_ERR, "Failed to detect CPU\n"); goto free_mce; } } mce->status = event->status; mce->bank = event->bank; switch (priv->cputype) { case CPU_AMD_SMCA: mce->synd = event->synd; mce->ipid = event->ipid; if (!mce->ipid || !mce->status) { log(TERM, LOG_ERR, "%s MSR required.\n", mce->ipid ? "Status" : "Ipid"); rc = -EINVAL; goto free_mce; } decode_smca_error(mce, priv); amd_decode_errcode(mce); break; default: break; } trace_seq_init(&s); report_mce_offline(&s, mce, priv); trace_seq_do_printf(&s); fflush(stdout); trace_seq_destroy(&s); free_mce: free(priv); free(mce); return rc; } int ras_mce_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) { unsigned long long val; struct ras_events *ras = context; struct mce_priv *mce = ras->mce_priv; struct mce_event e; int rc = 0; memset(&e, 0, sizeof(e)); /* Parse the MCE error data */ if (tep_get_field_val(s, event, "mcgcap", record, &val, 1) < 0) return -1; e.mcgcap = val; if (tep_get_field_val(s, event, "mcgstatus", record, &val, 1) < 0) return -1; e.mcgstatus = val; if (tep_get_field_val(s, event, "status", record, &val, 1) < 0) return -1; e.status = val; if (tep_get_field_val(s, event, "addr", record, &val, 1) < 0) return -1; e.addr = val; if (tep_get_field_val(s, event, "misc", record, &val, 1) < 0) return -1; e.misc = val; if (tep_get_field_val(s, event, "ip", record, &val, 1) < 0) return -1; e.ip = val; if (tep_get_field_val(s, event, "tsc", record, &val, 1) < 0) return -1; e.tsc = val; if (tep_get_field_val(s, event, "walltime", record, &val, 1) < 0) return -1; e.walltime = val; if (tep_get_field_val(s, event, "cpu", record, &val, 1) < 0) return -1; e.cpu = val; if (tep_get_field_val(s, event, "cpuid", record, &val, 1) < 0) return -1; e.cpuid = val; if (tep_get_field_val(s, event, "apicid", record, &val, 1) < 0) return -1; e.apicid = val; if (tep_get_field_val(s, event, "socketid", record, &val, 1) < 0) return -1; e.socketid = val; if (tep_get_field_val(s, event, "cs", record, &val, 1) < 0) return -1; e.cs = val; if (tep_get_field_val(s, event, "bank", record, &val, 1) < 0) return -1; e.bank = val; if (tep_get_field_val(s, event, "cpuvendor", record, &val, 1) < 0) return -1; e.cpuvendor = val; /* Get New entries */ if (tep_get_field_val(s, event, "synd", record, &val, 1) < 0) return -1; e.synd = val; if (tep_get_field_val(s, event, "ipid", record, &val, 1) < 0) return -1; e.ipid = val; /* Get Vendor-specfic Data, if any */ e.vdata = tep_get_field_raw(s, event, "v_data", record, &e.vdata_len, 1); switch (mce->cputype) { case CPU_GENERIC: break; case CPU_K8: rc = parse_amd_k8_event(ras, &e); break; case CPU_AMD_SMCA: case CPU_DHYANA: rc = parse_amd_smca_event(ras, &e); break; default: /* All other CPU types are Intel */ rc = parse_intel_event(ras, &e); } if (rc) return rc; if (!*e.error_msg && *e.mcastatus_msg) mce_snprintf(e.error_msg, "%s", e.mcastatus_msg); report_mce_event(ras, record, s, &e); #ifdef HAVE_SQLITE3 ras_store_mce_record(ras, &e); #endif #ifdef HAVE_ABRT_REPORT /* Report event to ABRT */ ras_report_mce_event(ras, &e); #endif return 0; } 0707010000005A000081A400000000000000000000000165C04BE4000016C0000000000000000000000000000000000000003100000000rasdaemon-0.8.0.49.git+f9cb13b/ras-mce-handler.h/* * Copyright (C) 2013 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #ifndef __RAS_MCE_HANDLER_H #define __RAS_MCE_HANDLER_H #include <stdint.h> #include "ras-events.h" #include <traceevent/event-parse.h> enum cputype { CPU_GENERIC, CPU_P6OLD, CPU_CORE2, /* 65nm and 45nm */ CPU_K8, CPU_P4, CPU_NEHALEM, CPU_DUNNINGTON, CPU_TULSA, CPU_INTEL, /* Intel architectural errors */ CPU_XEON75XX, CPU_SANDY_BRIDGE, CPU_SANDY_BRIDGE_EP, CPU_IVY_BRIDGE, CPU_IVY_BRIDGE_EPEX, CPU_HASWELL, CPU_HASWELL_EPEX, CPU_BROADWELL, CPU_BROADWELL_DE, CPU_BROADWELL_EPEX, CPU_KNIGHTS_LANDING, CPU_KNIGHTS_MILL, CPU_SKYLAKE_XEON, CPU_AMD_SMCA, CPU_DHYANA, CPU_ICELAKE_XEON, CPU_ICELAKE_DE, CPU_TREMONT_D, CPU_SAPPHIRERAPIDS, CPU_EMERALDRAPIDS, }; struct mce_event { /* Unparsed data, obtained directly from MCE tracing */ uint64_t mcgcap; uint64_t mcgstatus; uint64_t status; uint64_t addr; uint64_t misc; uint64_t ip; uint64_t tsc; uint64_t walltime; uint32_t cpu; uint32_t cpuid; uint32_t apicid; uint32_t socketid; uint8_t cs; uint8_t bank; uint8_t cpuvendor; uint64_t synd; /* MCA_SYND MSR: only valid on SMCA systems */ uint64_t ipid; /* MCA_IPID MSR: only valid on SMCA systems */ int32_t vdata_len; const uint64_t *vdata; /* Parsed data */ char frutext[17]; char timestamp[64]; char bank_name[64]; char error_msg[4096]; char mcgstatus_msg[256]; char mcistatus_msg[1024]; char mcastatus_msg[1024]; char user_action[4096]; char mc_location[256]; }; struct mce_priv { /* CPU Info */ char vendor[64]; unsigned int family, model; double mhz; enum cputype cputype; unsigned mc_error_support:1; char *processor_flags; }; #define mce_snprintf(buf, fmt, arg...) do { \ unsigned __n = strlen(buf); \ unsigned __len = sizeof(buf) - __n; \ if (!__len) \ break; \ if (__n) { \ snprintf(buf + __n, __len, " "); \ __len--; \ __n++; \ } \ snprintf(buf + __n, __len, fmt, ##arg); \ } while (0) /* register and handling routines */ int register_mce_handler(struct ras_events *ras, unsigned ncpus); int ras_mce_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context); /* enables intel iMC logs */ int set_intel_imc_log(enum cputype cputype, unsigned ncpus); /* Undertake AMD SMCA Error Decoding */ void decode_smca_error(struct mce_event *e, struct mce_priv *m); void amd_decode_errcode(struct mce_event *e); /* Per-CPU-type decoders for Intel CPUs */ void p4_decode_model(struct mce_event *e); void core2_decode_model(struct mce_event *e); void p6old_decode_model(struct mce_event *e); void nehalem_decode_model(struct mce_event *e); void xeon75xx_decode_model(struct mce_event *e); void dunnington_decode_model(struct mce_event *e); void snb_decode_model(struct ras_events *ras, struct mce_event *e); void ivb_decode_model(struct ras_events *ras, struct mce_event *e); void hsw_decode_model(struct ras_events *ras, struct mce_event *e); void knl_decode_model(struct ras_events *ras, struct mce_event *e); void tulsa_decode_model(struct mce_event *e); void broadwell_de_decode_model(struct ras_events *ras, struct mce_event *e); void broadwell_epex_decode_model(struct ras_events *ras, struct mce_event *e); void skylake_s_decode_model(struct ras_events *ras, struct mce_event *e); void i10nm_decode_model(enum cputype cputype, struct ras_events *ras, struct mce_event *e); /* AMD error code decode function */ void decode_amd_errcode(struct mce_event *e); /* Software defined banks */ #define MCE_EXTENDED_BANK 128 #define MCI_THRESHOLD_OVER (1ULL<<48) /* threshold error count overflow */ #define MCI_STATUS_VAL (1ULL<<63) /* valid error */ #define MCI_STATUS_OVER (1ULL<<62) /* previous errors lost */ #define MCI_STATUS_UC (1ULL<<61) /* uncorrected error */ #define MCI_STATUS_EN (1ULL<<60) /* error enabled */ #define MCI_STATUS_MISCV (1ULL<<59) /* misc error reg. valid */ #define MCI_STATUS_ADDRV (1ULL<<58) /* addr reg. valid */ #define MCI_STATUS_PCC (1ULL<<57) /* processor context corrupt */ #define MCI_STATUS_S (1ULL<<56) /* signalled */ #define MCI_STATUS_AR (1ULL<<55) /* action-required */ /* AMD-specific bits */ #define MCI_STATUS_TCC (1ULL<<55) /* Task context corrupt */ #define MCI_STATUS_SYNDV (1ULL<<53) /* synd reg. valid */ /* uncorrected error,deferred exception */ #define MCI_STATUS_DEFERRED (1ULL<<44) #define MCI_STATUS_POISON (1ULL<<43) /* access poisonous data */ #define MCG_STATUS_RIPV (1ULL<<0) /* restart ip valid */ #define MCG_STATUS_EIPV (1ULL<<1) /* eip points to correct instruction */ #define MCG_STATUS_MCIP (1ULL<<2) /* machine check in progress */ #define MCG_STATUS_LMCE (1ULL<<3) /* local machine check signaled */ /* Those functions are defined on per-cpu vendor C files */ int parse_intel_event(struct ras_events *ras, struct mce_event *e); int parse_amd_k8_event(struct ras_events *ras, struct mce_event *e); int parse_amd_smca_event(struct ras_events *ras, struct mce_event *e); #endif 0707010000005B000081A400000000000000000000000165C04BE400001451000000000000000000000000000000000000003C00000000rasdaemon-0.8.0.49.git+f9cb13b/ras-memory-failure-handler.c/* * Copyright (c) Huawei Technologies Co., Ltd. 2020. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include "ras-record.h" #include "ras-logger.h" #include "ras-report.h" #include "ras-memory-failure-handler.h" /* Memory failure - various types of pages */ enum mf_action_page_type { MF_MSG_KERNEL, MF_MSG_KERNEL_HIGH_ORDER, MF_MSG_SLAB, MF_MSG_DIFFERENT_COMPOUND, MF_MSG_POISONED_HUGE, MF_MSG_HUGE, MF_MSG_FREE_HUGE, MF_MSG_NON_PMD_HUGE, MF_MSG_UNMAP_FAILED, MF_MSG_DIRTY_SWAPCACHE, MF_MSG_CLEAN_SWAPCACHE, MF_MSG_DIRTY_MLOCKED_LRU, MF_MSG_CLEAN_MLOCKED_LRU, MF_MSG_DIRTY_UNEVICTABLE_LRU, MF_MSG_CLEAN_UNEVICTABLE_LRU, MF_MSG_DIRTY_LRU, MF_MSG_CLEAN_LRU, MF_MSG_TRUNCATED_LRU, MF_MSG_BUDDY, MF_MSG_BUDDY_2ND, MF_MSG_DAX, MF_MSG_UNSPLIT_THP, MF_MSG_UNKNOWN, }; /* Action results for various types of pages */ enum mf_action_result { MF_IGNORED, /* Error: cannot be handled */ MF_FAILED, /* Error: handling failed */ MF_DELAYED, /* Will be handled later */ MF_RECOVERED, /* Successfully recovered */ }; /* memory failure page types */ static const struct { int type; const char *page_type; } mf_page_type[] = { { MF_MSG_KERNEL, "reserved kernel page" }, { MF_MSG_KERNEL_HIGH_ORDER, "high-order kernel page"}, { MF_MSG_SLAB, "kernel slab page"}, { MF_MSG_DIFFERENT_COMPOUND, "different compound page after locking"}, { MF_MSG_POISONED_HUGE, "huge page already hardware poisoned"}, { MF_MSG_HUGE, "huge page"}, { MF_MSG_FREE_HUGE, "free huge page"}, { MF_MSG_NON_PMD_HUGE, "non-pmd-sized huge page"}, { MF_MSG_UNMAP_FAILED, "unmapping failed page"}, { MF_MSG_DIRTY_SWAPCACHE, "dirty swapcache page"}, { MF_MSG_CLEAN_SWAPCACHE, "clean swapcache page"}, { MF_MSG_DIRTY_MLOCKED_LRU, "dirty mlocked LRU page"}, { MF_MSG_CLEAN_MLOCKED_LRU, "clean mlocked LRU page"}, { MF_MSG_DIRTY_UNEVICTABLE_LRU, "dirty unevictable LRU page"}, { MF_MSG_CLEAN_UNEVICTABLE_LRU, "clean unevictable LRU page"}, { MF_MSG_DIRTY_LRU, "dirty LRU page"}, { MF_MSG_CLEAN_LRU, "clean LRU page"}, { MF_MSG_TRUNCATED_LRU, "already truncated LRU page"}, { MF_MSG_BUDDY, "free buddy page"}, { MF_MSG_BUDDY_2ND, "free buddy page (2nd try)"}, { MF_MSG_DAX, "dax page"}, { MF_MSG_UNSPLIT_THP, "unsplit thp"}, { MF_MSG_UNKNOWN, "unknown page"}, }; /* memory failure action results */ static const struct { int result; const char *action_result; } mf_action_result[] = { { MF_IGNORED, "Ignored" }, { MF_FAILED, "Failed" }, { MF_DELAYED, "Delayed" }, { MF_RECOVERED, "Recovered" }, }; static const char *get_page_type(int page_type) { unsigned int i; for (i = 0; i < ARRAY_SIZE(mf_page_type); i++) if (mf_page_type[i].type == page_type) return mf_page_type[i].page_type; return "unknown page"; } static const char *get_action_result(int result) { unsigned int i; for (i = 0; i < ARRAY_SIZE(mf_action_result); i++) if (mf_action_result[i].result == result) return mf_action_result[i].action_result; return "unknown"; } int ras_memory_failure_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) { unsigned long long val; struct ras_events *ras = context; time_t now; struct tm *tm; struct ras_mf_event ev; /* * Newer kernels (3.10-rc1 or upper) provide an uptime clock. * On previous kernels, the way to properly generate an event would * be to inject a fake one, measure its timestamp and diff it against * gettimeofday. We won't do it here. Instead, let's use uptime, * falling-back to the event report's time, if "uptime" clock is * not available (legacy kernels). */ if (ras->use_uptime) now = record->ts / user_hz + ras->uptime_diff; else now = time(NULL); tm = localtime(&now); if (tm) strftime(ev.timestamp, sizeof(ev.timestamp), "%Y-%m-%d %H:%M:%S %z", tm); else strncpy(ev.timestamp, "1970-01-01 00:00:00 +0000", sizeof(ev.timestamp)); trace_seq_printf(s, "%s ", ev.timestamp); if (tep_get_field_val(s, event, "pfn", record, &val, 1) < 0) return -1; sprintf(ev.pfn, "0x%llx", val); trace_seq_printf(s, "pfn=0x%llx ", val); if (tep_get_field_val(s, event, "type", record, &val, 1) < 0) return -1; ev.page_type = get_page_type(val); trace_seq_printf(s, "page_type=%s ", ev.page_type); if (tep_get_field_val(s, event, "result", record, &val, 1) < 0) return -1; ev.action_result = get_action_result(val); trace_seq_printf(s, "action_result=%s ", ev.action_result); /* Store data into the SQLite DB */ #ifdef HAVE_SQLITE3 ras_store_mf_event(ras, &ev); #endif #ifdef HAVE_ABRT_REPORT /* Report event to ABRT */ ras_report_mf_event(ras, &ev); #endif return 0; } 0707010000005C000081A400000000000000000000000165C04BE40000036D000000000000000000000000000000000000003C00000000rasdaemon-0.8.0.49.git+f9cb13b/ras-memory-failure-handler.h/* * Copyright (c) Huawei Technologies Co., Ltd. 2020. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. */ #ifndef __RAS_MEMORY_FAILURE_HANDLER_H #define __RAS_MEMORY_FAILURE_HANDLER_H #include "ras-events.h" #include <traceevent/event-parse.h> int ras_memory_failure_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context); #endif 0707010000005D000081A400000000000000000000000165C04BE400001812000000000000000000000000000000000000003A00000000rasdaemon-0.8.0.49.git+f9cb13b/ras-non-standard-handler.c/* * Copyright (c) 2016, The Linux Foundation. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 and * only version 2 as published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. */ #include <stdio.h> #include <stdlib.h> #include <stdbool.h> #include <string.h> #include <unistd.h> #include <traceevent/kbuffer.h> #include "ras-non-standard-handler.h" #include "ras-record.h" #include "ras-logger.h" #include "ras-report.h" static struct ras_ns_ev_decoder *ras_ns_ev_dec_list; void print_le_hex(struct trace_seq *s, const uint8_t *buf, int index) { trace_seq_printf(s, "%02x%02x%02x%02x", buf[index + 3], buf[index + 2], buf[index + 1], buf[index]); } static char *uuid_le(const char *uu) { static char uuid[sizeof("xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx")]; char *p = uuid; int i; static const unsigned char le[16] = {3, 2, 1, 0, 5, 4, 7, 6, 8, 9, 10, 11, 12, 13, 14, 15}; for (i = 0; i < 16; i++) { p += sprintf(p, "%.2x", (unsigned char)uu[le[i]]); switch (i) { case 3: case 5: case 7: case 9: *p++ = '-'; break; } } *p = 0; return uuid; } int register_ns_ev_decoder(struct ras_ns_ev_decoder *ns_ev_decoder) { struct ras_ns_ev_decoder *list; if (!ns_ev_decoder) return -1; ns_ev_decoder->next = NULL; #ifdef HAVE_SQLITE3 ns_ev_decoder->stmt_dec_record = NULL; #endif if (!ras_ns_ev_dec_list) { ras_ns_ev_dec_list = ns_ev_decoder; } else { list = ras_ns_ev_dec_list; while (list->next) list = list->next; list->next = ns_ev_decoder; } return 0; } int ras_ns_add_vendor_tables(struct ras_events *ras) { struct ras_ns_ev_decoder *ns_ev_decoder; int error = 0; #ifdef HAVE_SQLITE3 if (!ras) return -1; ns_ev_decoder = ras_ns_ev_dec_list; while (ns_ev_decoder) { if (ns_ev_decoder->add_table && !ns_ev_decoder->stmt_dec_record) { error = ns_ev_decoder->add_table(ras, ns_ev_decoder); if (error) break; } ns_ev_decoder = ns_ev_decoder->next; } if (error) return -1; #endif return 0; } static int find_ns_ev_decoder(const char *sec_type, struct ras_ns_ev_decoder **p_ns_ev_dec) { struct ras_ns_ev_decoder *ns_ev_decoder; int match = 0; ns_ev_decoder = ras_ns_ev_dec_list; while (ns_ev_decoder) { if (strcmp(uuid_le(sec_type), ns_ev_decoder->sec_type) == 0) { *p_ns_ev_dec = ns_ev_decoder; match = 1; break; } ns_ev_decoder = ns_ev_decoder->next; } if (!match) return -1; return 0; } void ras_ns_finalize_vendor_tables(void) { #ifdef HAVE_SQLITE3 struct ras_ns_ev_decoder *ns_ev_decoder = ras_ns_ev_dec_list; while (ns_ev_decoder) { if (ns_ev_decoder->stmt_dec_record) { ras_mc_finalize_vendor_table(ns_ev_decoder->stmt_dec_record); ns_ev_decoder->stmt_dec_record = NULL; } ns_ev_decoder = ns_ev_decoder->next; } #endif } static void unregister_ns_ev_decoder(void) { #ifdef HAVE_SQLITE3 ras_ns_finalize_vendor_tables(); #endif ras_ns_ev_dec_list = NULL; } int ras_non_standard_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context) { int len, i, line_count; unsigned long long val; struct ras_events *ras = context; time_t now; struct tm *tm; struct ras_non_standard_event ev; struct ras_ns_ev_decoder *ns_ev_decoder; /* * Newer kernels (3.10-rc1 or upper) provide an uptime clock. * On previous kernels, the way to properly generate an event would * be to inject a fake one, measure its timestamp and diff it against * gettimeofday. We won't do it here. Instead, let's use uptime, * falling-back to the event report's time, if "uptime" clock is * not available (legacy kernels). */ if (ras->use_uptime) now = record->ts / user_hz + ras->uptime_diff; else now = time(NULL); tm = localtime(&now); if (tm) strftime(ev.timestamp, sizeof(ev.timestamp), "%Y-%m-%d %H:%M:%S %z", tm); trace_seq_printf(s, "%s ", ev.timestamp); if (tep_get_field_val(s, event, "sev", record, &val, 1) < 0) return -1; switch (val) { case GHES_SEV_NO: ev.severity = "Informational"; break; case GHES_SEV_CORRECTED: ev.severity = "Corrected"; break; case GHES_SEV_RECOVERABLE: ev.severity = "Recoverable"; break; default: case GHES_SEV_PANIC: ev.severity = "Fatal"; } trace_seq_printf(s, " %s", ev.severity); ev.sec_type = tep_get_field_raw(s, event, "sec_type", record, &len, 1); if (!ev.sec_type) return -1; if (strcmp(uuid_le(ev.sec_type), "e8ed898d-df16-43cc-8ecc-54f060ef157f") == 0) trace_seq_printf(s, "\n section type: %s", "Ampere Specific Error\n"); else trace_seq_printf(s, " section type: %s", uuid_le(ev.sec_type)); ev.fru_text = tep_get_field_raw(s, event, "fru_text", record, &len, 1); ev.fru_id = tep_get_field_raw(s, event, "fru_id", record, &len, 1); trace_seq_printf(s, " fru text: %s fru id: %s ", ev.fru_text, uuid_le(ev.fru_id)); if (tep_get_field_val(s, event, "len", record, &val, 1) < 0) return -1; ev.length = val; trace_seq_printf(s, " length: %d", ev.length); ev.error = tep_get_field_raw(s, event, "buf", record, &len, 1); if (!ev.error) return -1; if (!find_ns_ev_decoder(ev.sec_type, &ns_ev_decoder)) { ns_ev_decoder->decode(ras, ns_ev_decoder, s, &ev); } else { len = ev.length; i = 0; line_count = 0; trace_seq_printf(s, " error:\n %08x: ", i); while (len >= 4) { print_le_hex(s, ev.error, i); i += 4; len -= 4; if (++line_count == 4) { trace_seq_printf(s, "\n %08x: ", i); line_count = 0; } else trace_seq_printf(s, " "); } } /* Insert data into the SGBD */ #ifdef HAVE_SQLITE3 ras_store_non_standard_record(ras, &ev); #endif #ifdef HAVE_ABRT_REPORT /* Report event to ABRT */ ras_report_non_standard_event(ras, &ev); #endif return 0; } __attribute__((destructor)) static void ns_exit(void) { unregister_ns_ev_decoder(); } 0707010000005E000081A400000000000000000000000165C04BE40000062A000000000000000000000000000000000000003A00000000rasdaemon-0.8.0.49.git+f9cb13b/ras-non-standard-handler.h/* * Copyright (c) 2016, The Linux Foundation. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 and * only version 2 as published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. */ #ifndef __RAS_NON_STANDARD_HANDLER_H #define __RAS_NON_STANDARD_HANDLER_H #include "ras-events.h" #include <traceevent/event-parse.h> struct ras_ns_ev_decoder { struct ras_ns_ev_decoder *next; const char *sec_type; int (*add_table)(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder); int (*decode)(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder, struct trace_seq *s, struct ras_non_standard_event *event); #ifdef HAVE_SQLITE3 #include <sqlite3.h> sqlite3_stmt *stmt_dec_record; #endif }; int ras_non_standard_event_handler(struct trace_seq *s, struct tep_record *record, struct tep_event *event, void *context); void print_le_hex(struct trace_seq *s, const uint8_t *buf, int index); #ifdef HAVE_NON_STANDARD int register_ns_ev_decoder(struct ras_ns_ev_decoder *ns_ev_decoder); int ras_ns_add_vendor_tables(struct ras_events *ras); void ras_ns_finalize_vendor_tables(void); #else static inline int register_ns_ev_decoder(struct ras_ns_ev_decoder *ns_ev_decoder) { return 0; }; #endif #endif 0707010000005F000081A400000000000000000000000165C04BE40000224E000000000000000000000000000000000000003400000000rasdaemon-0.8.0.49.git+f9cb13b/ras-page-isolation.c/* * Copyright (c) Huawei Technologies Co., Ltd. 2020-2020. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. */ #include <ctype.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <sys/stat.h> #include <fcntl.h> #include <errno.h> #include "ras-logger.h" #include "ras-page-isolation.h" #define PARSED_ENV_LEN 50 static const struct config threshold_units[] = { { "m", 1000 }, { "k", 1000 }, { "", 1 }, {} }; static const struct config cycle_units[] = { { "d", 24 }, { "h", 60 }, { "m", 60 }, { "s", 1 }, {} }; static struct isolation threshold = { .name = "PAGE_CE_THRESHOLD", .units = threshold_units, .env = "50", .unit = "", }; static struct isolation cycle = { .name = "PAGE_CE_REFRESH_CYCLE", .units = cycle_units, .env = "24h", .unit = "h", }; static const char *kernel_offline[] = { [OFFLINE_SOFT] = "/sys/devices/system/memory/soft_offline_page", [OFFLINE_HARD] = "/sys/devices/system/memory/hard_offline_page", [OFFLINE_SOFT_THEN_HARD] = "/sys/devices/system/memory/soft_offline_page", }; static const struct config offline_choice[] = { { "off", OFFLINE_OFF }, { "account", OFFLINE_ACCOUNT }, { "soft", OFFLINE_SOFT }, { "hard", OFFLINE_HARD }, { "soft-then-hard", OFFLINE_SOFT_THEN_HARD }, {} }; static const char *page_state[] = { [PAGE_ONLINE] = "online", [PAGE_OFFLINE] = "offlined", [PAGE_OFFLINE_FAILED] = "offline-failed", }; static enum otype offline = OFFLINE_SOFT; static struct rb_root page_records; static void page_offline_init(void) { const char *env = "PAGE_CE_ACTION"; char *choice = getenv(env); const struct config *c = NULL; int matched = 0; if (choice) { for (c = offline_choice; c->name; c++) { if (!strcasecmp(choice, c->name)) { offline = c->val; matched = 1; break; } } } if (!matched) log(TERM, LOG_INFO, "Improper %s, set to default soft\n", env); if (offline > OFFLINE_ACCOUNT && access(kernel_offline[offline], W_OK)) { log(TERM, LOG_INFO, "Kernel does not support page offline interface\n"); offline = OFFLINE_ACCOUNT; } log(TERM, LOG_INFO, "Page offline choice on Corrected Errors is %s\n", offline_choice[offline].name); } static void parse_isolation_env(struct isolation *config) { char *env = getenv(config->name); char *unit = NULL; const struct config *units = NULL; int i, no_unit; int valid = 0; int unit_matched = 0; unsigned long value, tmp; /* check if env is valid */ if (env && strlen(env)) { /* All the character before unit must be digit */ for (i = 0; i < strlen(env) - 1; i++) { if (!isdigit(env[i])) goto parse; } if (sscanf(env, "%lu", &value) < 1 || !value) goto parse; /* check if the unit is valid */ unit = env + strlen(env) - 1; /* no unit, all the character are value character */ if (isdigit(*unit)) { valid = 1; no_unit = 1; goto parse; } for (units = config->units; units->name; units++) { /* value character and unit character are both valid */ if (!strcasecmp(unit, units->name)) { valid = 1; no_unit = 0; break; } } } parse: /* if invalid, use default env */ if (valid) { config->env = env; if (!no_unit) config->unit = unit; } else { log(TERM, LOG_INFO, "Improper %s, set to default %s.\n", config->name, config->env); } /* if env value string is greater than ulong_max, truncate the last digit */ sscanf(config->env, "%lu", &value); for (units = config->units; units->name; units++) { if (!strcasecmp(config->unit, units->name)) unit_matched = 1; if (unit_matched) { tmp = value; value *= units->val; if (tmp != 0 && value / tmp != units->val) config->overflow = true; } } config->val = value; /* In order to output value and unit perfectly */ config->unit = no_unit ? config->unit : ""; } static void parse_env_string(struct isolation *config, char *str, unsigned int size) { int i; if (config->overflow) { /* when overflow, use basic unit */ for (i = 0; config->units[i].name; i++) ; snprintf(str, size, "%lu%s", config->val, config->units[i - 1].name); log(TERM, LOG_INFO, "%s is set overflow(%s), truncate it\n", config->name, config->env); } else { snprintf(str, size, "%s%s", config->env, config->unit); } } static void page_isolation_init(void) { char threshold_string[PARSED_ENV_LEN]; char cycle_string[PARSED_ENV_LEN]; /** * It's unnecessary to parse threshold configuration when offline * choice is off. */ if (offline == OFFLINE_OFF) return; parse_isolation_env(&threshold); parse_isolation_env(&cycle); parse_env_string(&threshold, threshold_string, sizeof(threshold_string)); parse_env_string(&cycle, cycle_string, sizeof(cycle_string)); log(TERM, LOG_INFO, "Threshold of memory Corrected Errors is %s / %s\n", threshold_string, cycle_string); } void ras_page_account_init(void) { page_offline_init(); page_isolation_init(); } static int do_page_offline(unsigned long long addr, enum otype type) { int fd, rc; char buf[20]; fd = open(kernel_offline[type], O_WRONLY); if (fd == -1) { log(TERM, LOG_ERR, "[%s]:open file: %s failed\n", __func__, kernel_offline[type]); return -1; } sprintf(buf, "%#llx", addr); rc = write(fd, buf, strlen(buf)); if (rc < 0) { log(TERM, LOG_ERR, "page offline addr(%s) by %s failed, errno:%d\n", buf, kernel_offline[type], errno); } close(fd); return rc; } static void page_offline(struct page_record *pr) { unsigned long long addr = pr->addr; int ret; /* Offlining page is not required */ if (offline <= OFFLINE_ACCOUNT) { log(TERM, LOG_INFO, "PAGE_CE_ACTION=%s, ignore to offline page at %#llx\n", offline_choice[offline].name, addr); return; } /* Ignore offlined pages */ if (pr->offlined == PAGE_OFFLINE) { log(TERM, LOG_INFO, "page at %#llx is already offlined, ignore\n", addr); return; } /* Time to silence this noisy page */ if (offline == OFFLINE_SOFT_THEN_HARD) { ret = do_page_offline(addr, OFFLINE_SOFT); if (ret < 0) ret = do_page_offline(addr, OFFLINE_HARD); } else { ret = do_page_offline(addr, offline); } pr->offlined = ret < 0 ? PAGE_OFFLINE_FAILED : PAGE_OFFLINE; log(TERM, LOG_INFO, "Result of offlining page at %#llx: %s\n", addr, page_state[pr->offlined]); } static void page_record(struct page_record *pr, unsigned int count, time_t time) { unsigned long period = time - pr->start; unsigned long tolerate; if (period >= cycle.val) { /** * Since we don't refresh automatically, it is possible that the period * between two occurences will be longer than the pre-configured refresh cycle. * In this case, we tolerate the frequency of the whole period up to * the pre-configured threshold. */ tolerate = (period / (double)cycle.val) * threshold.val; pr->count -= (tolerate > pr->count) ? pr->count : tolerate; pr->start = time; pr->excess = 0; } pr->count += count; if (pr->count >= threshold.val) { log(TERM, LOG_INFO, "Corrected Errors at %#llx exceeded threshold\n", pr->addr); /** * Backup ce count of current cycle to enable next round, which actually * should never happen if we can disable overflow completely in the same * time unit (but sadly we can't). */ pr->excess += pr->count; pr->count = 0; page_offline(pr); } } static struct page_record *page_lookup_insert(unsigned long long addr) { struct rb_node **entry = &page_records.rb_node; struct rb_node *parent = NULL; struct page_record *pr = NULL, *find = NULL; while (*entry) { parent = *entry; pr = rb_entry(parent, struct page_record, entry); if (addr == pr->addr) { return pr; } else if (addr < pr->addr) { entry = &(*entry)->rb_left; } else { entry = &(*entry)->rb_right; } } find = calloc(1, sizeof(struct page_record)); if (!find) { log(TERM, LOG_ERR, "No memory for page records\n"); return NULL; } find->addr = addr; rb_link_node(&find->entry, parent, entry); rb_insert_color(&find->entry, &page_records); return find; } void ras_record_page_error(unsigned long long addr, unsigned int count, time_t time) { struct page_record *pr = NULL; if (offline == OFFLINE_OFF) return; pr = page_lookup_insert(addr & PAGE_MASK); if (pr) { if (!pr->start) pr->start = time; page_record(pr, count, time); } } 07070100000060000081A400000000000000000000000165C04BE4000005B6000000000000000000000000000000000000003400000000rasdaemon-0.8.0.49.git+f9cb13b/ras-page-isolation.h/* * Copyright (c) Huawei Technologies Co., Ltd. 2020-2020. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. */ #ifndef __RAS_PAGE_ISOLATION_H #define __RAS_PAGE_ISOLATION_H #include <time.h> #include <stdbool.h> #include "rbtree.h" #define PAGE_SHIFT 12 #define PAGE_SIZE (1 << PAGE_SHIFT) #define PAGE_MASK (~(PAGE_SIZE-1)) struct config { char *name; unsigned long val; }; enum otype { OFFLINE_OFF, OFFLINE_ACCOUNT, OFFLINE_SOFT, OFFLINE_HARD, OFFLINE_SOFT_THEN_HARD, }; enum pstate { PAGE_ONLINE, PAGE_OFFLINE, PAGE_OFFLINE_FAILED, }; struct page_record { struct rb_node entry; unsigned long long addr; time_t start; enum pstate offlined; unsigned long count; unsigned long excess; }; struct isolation { char *name; char *env; const struct config *units; unsigned long val; bool overflow; char *unit; }; void ras_page_account_init(void); void ras_record_page_error(unsigned long long addr, unsigned count, time_t time); #endif 07070100000061000081A400000000000000000000000165C04BE40000D361000000000000000000000000000000000000002C00000000rasdaemon-0.8.0.49.git+f9cb13b/ras-record.c/* * Copyright (C) 2013 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> * Copyright (c) 2016, The Linux Foundation. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ /* * BuildRequires: sqlite-devel */ #include <string.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <errno.h> #include <sys/stat.h> #include "ras-events.h" #include "ras-mc-handler.h" #include "ras-aer-handler.h" #include "ras-mce-handler.h" #include "ras-logger.h" /* #define DEBUG_SQL 1 */ #define SQLITE_RAS_DB RASSTATEDIR "/" RAS_DB_FNAME /* * Table and functions to handle ras:mc_event */ static const struct db_fields mc_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "err_count", .type = "INTEGER" }, { .name = "err_type", .type = "TEXT" }, { .name = "err_msg", .type = "TEXT" }, { .name = "label", .type = "TEXT" }, { .name = "mc", .type = "INTEGER" }, { .name = "top_layer", .type = "INTEGER" }, { .name = "middle_layer", .type = "INTEGER" }, { .name = "lower_layer", .type = "INTEGER" }, { .name = "address", .type = "INTEGER" }, { .name = "grain", .type = "INTEGER" }, { .name = "syndrome", .type = "INTEGER" }, { .name = "driver_detail", .type = "TEXT" }, }; static const struct db_table_descriptor mc_event_tab = { .name = "mc_event", .fields = mc_event_fields, .num_fields = ARRAY_SIZE(mc_event_fields), }; int ras_store_mc_event(struct ras_events *ras, struct ras_mc_event *ev) { int rc; struct sqlite3_priv *priv = ras->db_priv; if (!priv || !priv->stmt_mc_event) return 0; log(TERM, LOG_INFO, "mc_event store: %p\n", priv->stmt_mc_event); sqlite3_bind_text(priv->stmt_mc_event, 1, ev->timestamp, -1, NULL); sqlite3_bind_int (priv->stmt_mc_event, 2, ev->error_count); sqlite3_bind_text(priv->stmt_mc_event, 3, ev->error_type, -1, NULL); sqlite3_bind_text(priv->stmt_mc_event, 4, ev->msg, -1, NULL); sqlite3_bind_text(priv->stmt_mc_event, 5, ev->label, -1, NULL); sqlite3_bind_int (priv->stmt_mc_event, 6, ev->mc_index); sqlite3_bind_int (priv->stmt_mc_event, 7, ev->top_layer); sqlite3_bind_int (priv->stmt_mc_event, 8, ev->middle_layer); sqlite3_bind_int (priv->stmt_mc_event, 9, ev->lower_layer); sqlite3_bind_int64(priv->stmt_mc_event, 10, ev->address); sqlite3_bind_int64(priv->stmt_mc_event, 11, ev->grain); sqlite3_bind_int64(priv->stmt_mc_event, 12, ev->syndrome); sqlite3_bind_text(priv->stmt_mc_event, 13, ev->driver_detail, -1, NULL); rc = sqlite3_step(priv->stmt_mc_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do mc_event step on sqlite: error = %d\n", rc); rc = sqlite3_reset(priv->stmt_mc_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed reset mc_event on sqlite: error = %d\n", rc); log(TERM, LOG_INFO, "register inserted at db\n"); return rc; } /* * Table and functions to handle ras:aer */ #ifdef HAVE_AER static const struct db_fields aer_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "dev_name", .type = "TEXT" }, { .name = "err_type", .type = "TEXT" }, { .name = "err_msg", .type = "TEXT" }, }; static const struct db_table_descriptor aer_event_tab = { .name = "aer_event", .fields = aer_event_fields, .num_fields = ARRAY_SIZE(aer_event_fields), }; int ras_store_aer_event(struct ras_events *ras, struct ras_aer_event *ev) { int rc; struct sqlite3_priv *priv = ras->db_priv; if (!priv || !priv->stmt_aer_event) return 0; log(TERM, LOG_INFO, "aer_event store: %p\n", priv->stmt_aer_event); sqlite3_bind_text(priv->stmt_aer_event, 1, ev->timestamp, -1, NULL); sqlite3_bind_text(priv->stmt_aer_event, 2, ev->dev_name, -1, NULL); sqlite3_bind_text(priv->stmt_aer_event, 3, ev->error_type, -1, NULL); sqlite3_bind_text(priv->stmt_aer_event, 4, ev->msg, -1, NULL); rc = sqlite3_step(priv->stmt_aer_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do aer_event step on sqlite: error = %d\n", rc); rc = sqlite3_reset(priv->stmt_aer_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed reset aer_event on sqlite: error = %d\n", rc); log(TERM, LOG_INFO, "register inserted at db\n"); return rc; } #endif /* * Table and functions to handle ras:non standard */ #ifdef HAVE_NON_STANDARD static const struct db_fields non_standard_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "sec_type", .type = "BLOB" }, { .name = "fru_id", .type = "BLOB" }, { .name = "fru_text", .type = "TEXT" }, { .name = "severity", .type = "TEXT" }, { .name = "error", .type = "BLOB" }, }; static const struct db_table_descriptor non_standard_event_tab = { .name = "non_standard_event", .fields = non_standard_event_fields, .num_fields = ARRAY_SIZE(non_standard_event_fields), }; int ras_store_non_standard_record(struct ras_events *ras, struct ras_non_standard_event *ev) { int rc; struct sqlite3_priv *priv = ras->db_priv; if (!priv || !priv->stmt_non_standard_record) return 0; log(TERM, LOG_INFO, "non_standard_event store: %p\n", priv->stmt_non_standard_record); sqlite3_bind_text(priv->stmt_non_standard_record, 1, ev->timestamp, -1, NULL); sqlite3_bind_blob(priv->stmt_non_standard_record, 2, ev->sec_type, -1, NULL); sqlite3_bind_blob(priv->stmt_non_standard_record, 3, ev->fru_id, 16, NULL); sqlite3_bind_text(priv->stmt_non_standard_record, 4, ev->fru_text, -1, NULL); sqlite3_bind_text(priv->stmt_non_standard_record, 5, ev->severity, -1, NULL); sqlite3_bind_blob(priv->stmt_non_standard_record, 6, ev->error, ev->length, NULL); rc = sqlite3_step(priv->stmt_non_standard_record); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do non_standard_event step on sqlite: error = %d\n", rc); rc = sqlite3_reset(priv->stmt_non_standard_record); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed reset non_standard_event on sqlite: error = %d\n", rc); log(TERM, LOG_INFO, "register inserted at db\n"); return rc; } #endif /* * Table and functions to handle ras:arm */ #ifdef HAVE_ARM static const struct db_fields arm_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "error_count", .type = "INTEGER" }, { .name = "affinity", .type = "INTEGER" }, { .name = "mpidr", .type = "INTEGER" }, { .name = "running_state", .type = "INTEGER" }, { .name = "psci_state", .type = "INTEGER" }, { .name = "err_info", .type = "BLOB" }, { .name = "context_info", .type = "BLOB" }, { .name = "vendor_info", .type = "BLOB" }, }; static const struct db_table_descriptor arm_event_tab = { .name = "arm_event", .fields = arm_event_fields, .num_fields = ARRAY_SIZE(arm_event_fields), }; int ras_store_arm_record(struct ras_events *ras, struct ras_arm_event *ev) { int rc; struct sqlite3_priv *priv = ras->db_priv; if (!priv || !priv->stmt_arm_record) return 0; log(TERM, LOG_INFO, "arm_event store: %p\n", priv->stmt_arm_record); sqlite3_bind_text(priv->stmt_arm_record, 1, ev->timestamp, -1, NULL); sqlite3_bind_int (priv->stmt_arm_record, 2, ev->error_count); sqlite3_bind_int (priv->stmt_arm_record, 3, ev->affinity); sqlite3_bind_int64(priv->stmt_arm_record, 4, ev->mpidr); sqlite3_bind_int (priv->stmt_arm_record, 5, ev->running_state); sqlite3_bind_int (priv->stmt_arm_record, 6, ev->psci_state); sqlite3_bind_blob(priv->stmt_arm_record, 7, ev->pei_error, ev->pei_len, NULL); sqlite3_bind_blob(priv->stmt_arm_record, 8, ev->ctx_error, ev->ctx_len, NULL); sqlite3_bind_blob(priv->stmt_arm_record, 9, ev->vsei_error, ev->oem_len, NULL); rc = sqlite3_step(priv->stmt_arm_record); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do arm_event step on sqlite: error = %d\n", rc); rc = sqlite3_reset(priv->stmt_arm_record); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed reset arm_event on sqlite: error = %d\n", rc); log(TERM, LOG_INFO, "register inserted at db\n"); return rc; } #endif #ifdef HAVE_EXTLOG static const struct db_fields extlog_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "etype", .type = "INTEGER" }, { .name = "error_count", .type = "INTEGER" }, { .name = "severity", .type = "INTEGER" }, { .name = "address", .type = "INTEGER" }, { .name = "fru_id", .type = "BLOB" }, { .name = "fru_text", .type = "TEXT" }, { .name = "cper_data", .type = "BLOB" }, }; static const struct db_table_descriptor extlog_event_tab = { .name = "extlog_event", .fields = extlog_event_fields, .num_fields = ARRAY_SIZE(extlog_event_fields), }; int ras_store_extlog_mem_record(struct ras_events *ras, struct ras_extlog_event *ev) { int rc; struct sqlite3_priv *priv = ras->db_priv; if (!priv || !priv->stmt_extlog_record) return 0; log(TERM, LOG_INFO, "extlog_record store: %p\n", priv->stmt_extlog_record); sqlite3_bind_text(priv->stmt_extlog_record, 1, ev->timestamp, -1, NULL); sqlite3_bind_int (priv->stmt_extlog_record, 2, ev->etype); sqlite3_bind_int (priv->stmt_extlog_record, 3, ev->error_seq); sqlite3_bind_int (priv->stmt_extlog_record, 4, ev->severity); sqlite3_bind_int64(priv->stmt_extlog_record, 5, ev->address); sqlite3_bind_blob(priv->stmt_extlog_record, 6, ev->fru_id, 16, NULL); sqlite3_bind_text(priv->stmt_extlog_record, 7, ev->fru_text, -1, NULL); sqlite3_bind_blob(priv->stmt_extlog_record, 8, ev->cper_data, ev->cper_data_length, NULL); rc = sqlite3_step(priv->stmt_extlog_record); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do extlog_mem_record step on sqlite: error = %d\n", rc); rc = sqlite3_reset(priv->stmt_extlog_record); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed reset extlog_mem_record on sqlite: error = %d\n", rc); log(TERM, LOG_INFO, "register inserted at db\n"); return rc; } #endif /* * Table and functions to handle mce:mce_record */ #ifdef HAVE_MCE static const struct db_fields mce_record_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, /* MCE registers */ { .name = "mcgcap", .type = "INTEGER" }, { .name = "mcgstatus", .type = "INTEGER" }, { .name = "status", .type = "INTEGER" }, { .name = "addr", .type = "INTEGER" }, // 5 { .name = "misc", .type = "INTEGER" }, { .name = "ip", .type = "INTEGER" }, { .name = "tsc", .type = "INTEGER" }, { .name = "walltime", .type = "INTEGER" }, { .name = "cpu", .type = "INTEGER" }, // 10 { .name = "cpuid", .type = "INTEGER" }, { .name = "apicid", .type = "INTEGER" }, { .name = "socketid", .type = "INTEGER" }, { .name = "cs", .type = "INTEGER" }, { .name = "bank", .type = "INTEGER" }, //15 { .name = "cpuvendor", .type = "INTEGER" }, /* Parsed data - will likely change */ { .name = "bank_name", .type = "TEXT" }, { .name = "error_msg", .type = "TEXT" }, { .name = "mcgstatus_msg", .type = "TEXT" }, { .name = "mcistatus_msg", .type = "TEXT" }, // 20 { .name = "mcastatus_msg", .type = "TEXT" }, { .name = "user_action", .type = "TEXT" }, { .name = "mc_location", .type = "TEXT" }, }; static const struct db_table_descriptor mce_record_tab = { .name = "mce_record", .fields = mce_record_fields, .num_fields = ARRAY_SIZE(mce_record_fields), }; int ras_store_mce_record(struct ras_events *ras, struct mce_event *ev) { int rc; struct sqlite3_priv *priv = ras->db_priv; if (!priv || !priv->stmt_mce_record) return 0; log(TERM, LOG_INFO, "mce_record store: %p\n", priv->stmt_mce_record); sqlite3_bind_text(priv->stmt_mce_record, 1, ev->timestamp, -1, NULL); sqlite3_bind_int (priv->stmt_mce_record, 2, ev->mcgcap); sqlite3_bind_int (priv->stmt_mce_record, 3, ev->mcgstatus); sqlite3_bind_int64(priv->stmt_mce_record, 4, ev->status); sqlite3_bind_int64(priv->stmt_mce_record, 5, ev->addr); sqlite3_bind_int64(priv->stmt_mce_record, 6, ev->misc); sqlite3_bind_int64(priv->stmt_mce_record, 7, ev->ip); sqlite3_bind_int64(priv->stmt_mce_record, 8, ev->tsc); sqlite3_bind_int64(priv->stmt_mce_record, 9, ev->walltime); sqlite3_bind_int (priv->stmt_mce_record, 10, ev->cpu); sqlite3_bind_int (priv->stmt_mce_record, 11, ev->cpuid); sqlite3_bind_int (priv->stmt_mce_record, 12, ev->apicid); sqlite3_bind_int (priv->stmt_mce_record, 13, ev->socketid); sqlite3_bind_int (priv->stmt_mce_record, 14, ev->cs); sqlite3_bind_int (priv->stmt_mce_record, 15, ev->bank); sqlite3_bind_int (priv->stmt_mce_record, 16, ev->cpuvendor); sqlite3_bind_text(priv->stmt_mce_record, 17, ev->bank_name, -1, NULL); sqlite3_bind_text(priv->stmt_mce_record, 18, ev->error_msg, -1, NULL); sqlite3_bind_text(priv->stmt_mce_record, 19, ev->mcgstatus_msg, -1, NULL); sqlite3_bind_text(priv->stmt_mce_record, 20, ev->mcistatus_msg, -1, NULL); sqlite3_bind_text(priv->stmt_mce_record, 21, ev->mcastatus_msg, -1, NULL); sqlite3_bind_text(priv->stmt_mce_record, 22, ev->user_action, -1, NULL); sqlite3_bind_text(priv->stmt_mce_record, 23, ev->mc_location, -1, NULL); rc = sqlite3_step(priv->stmt_mce_record); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do mce_record step on sqlite: error = %d\n", rc); rc = sqlite3_reset(priv->stmt_mce_record); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed reset mce_record on sqlite: error = %d\n", rc); log(TERM, LOG_INFO, "register inserted at db\n"); return rc; } #endif /* * Table and functions to handle devlink:devlink_health_report */ #ifdef HAVE_DEVLINK static const struct db_fields devlink_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "bus_name", .type = "TEXT" }, { .name = "dev_name", .type = "TEXT" }, { .name = "driver_name", .type = "TEXT" }, { .name = "reporter_name", .type = "TEXT" }, { .name = "msg", .type = "TEXT" }, }; static const struct db_table_descriptor devlink_event_tab = { .name = "devlink_event", .fields = devlink_event_fields, .num_fields = ARRAY_SIZE(devlink_event_fields), }; int ras_store_devlink_event(struct ras_events *ras, struct devlink_event *ev) { int rc; struct sqlite3_priv *priv = ras->db_priv; if (!priv || !priv->stmt_devlink_event) return 0; log(TERM, LOG_INFO, "devlink_event store: %p\n", priv->stmt_devlink_event); sqlite3_bind_text(priv->stmt_devlink_event, 1, ev->timestamp, -1, NULL); sqlite3_bind_text(priv->stmt_devlink_event, 2, ev->bus_name, -1, NULL); sqlite3_bind_text(priv->stmt_devlink_event, 3, ev->dev_name, -1, NULL); sqlite3_bind_text(priv->stmt_devlink_event, 4, ev->driver_name, -1, NULL); sqlite3_bind_text(priv->stmt_devlink_event, 5, ev->reporter_name, -1, NULL); sqlite3_bind_text(priv->stmt_devlink_event, 6, ev->msg, -1, NULL); rc = sqlite3_step(priv->stmt_devlink_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do devlink_event step on sqlite: error = %d\n", rc); rc = sqlite3_reset(priv->stmt_devlink_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed reset devlink_event on sqlite: error = %d\n", rc); log(TERM, LOG_INFO, "register inserted at db\n"); return rc; } #endif /* * Table and functions to handle block:block_rq_{complete|error} */ #ifdef HAVE_DISKERROR static const struct db_fields diskerror_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "dev", .type = "TEXT" }, { .name = "sector", .type = "INTEGER" }, { .name = "nr_sector", .type = "INTEGER" }, { .name = "error", .type = "TEXT" }, { .name = "rwbs", .type = "TEXT" }, { .name = "cmd", .type = "TEXT" }, }; static const struct db_table_descriptor diskerror_event_tab = { .name = "disk_errors", .fields = diskerror_event_fields, .num_fields = ARRAY_SIZE(diskerror_event_fields), }; int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev) { int rc; struct sqlite3_priv *priv = ras->db_priv; if (!priv || !priv->stmt_diskerror_event) return 0; log(TERM, LOG_INFO, "diskerror_event store: %p\n", priv->stmt_diskerror_event); sqlite3_bind_text(priv->stmt_diskerror_event, 1, ev->timestamp, -1, NULL); sqlite3_bind_text(priv->stmt_diskerror_event, 2, ev->dev, -1, NULL); sqlite3_bind_int64(priv->stmt_diskerror_event, 3, ev->sector); sqlite3_bind_int(priv->stmt_diskerror_event, 4, ev->nr_sector); sqlite3_bind_text(priv->stmt_diskerror_event, 5, ev->error, -1, NULL); sqlite3_bind_text(priv->stmt_diskerror_event, 6, ev->rwbs, -1, NULL); sqlite3_bind_text(priv->stmt_diskerror_event, 7, ev->cmd, -1, NULL); rc = sqlite3_step(priv->stmt_diskerror_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do diskerror_event step on sqlite: error = %d\n", rc); rc = sqlite3_reset(priv->stmt_diskerror_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed reset diskerror_event on sqlite: error = %d\n", rc); log(TERM, LOG_INFO, "register inserted at db\n"); return rc; } #endif /* * Table and functions to handle ras:memory_failure */ #ifdef HAVE_MEMORY_FAILURE static const struct db_fields mf_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "pfn", .type = "TEXT" }, { .name = "page_type", .type = "TEXT" }, { .name = "action_result", .type = "TEXT" }, }; static const struct db_table_descriptor mf_event_tab = { .name = "memory_failure_event", .fields = mf_event_fields, .num_fields = ARRAY_SIZE(mf_event_fields), }; int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { int rc; struct sqlite3_priv *priv = ras->db_priv; if (!priv || !priv->stmt_mf_event) return 0; log(TERM, LOG_INFO, "memory_failure_event store: %p\n", priv->stmt_mf_event); sqlite3_bind_text(priv->stmt_mf_event, 1, ev->timestamp, -1, NULL); sqlite3_bind_text(priv->stmt_mf_event, 2, ev->pfn, -1, NULL); sqlite3_bind_text(priv->stmt_mf_event, 3, ev->page_type, -1, NULL); sqlite3_bind_text(priv->stmt_mf_event, 4, ev->action_result, -1, NULL); rc = sqlite3_step(priv->stmt_mf_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do memory_failure_event step on sqlite: error = %d\n", rc); rc = sqlite3_reset(priv->stmt_mf_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed reset memory_failure_event on sqlite: error = %d\n", rc); log(TERM, LOG_INFO, "register inserted at db\n"); return rc; } #endif #ifdef HAVE_CXL /* * Table and functions to handle cxl:cxl_poison */ static const struct db_fields cxl_poison_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "memdev", .type = "TEXT" }, { .name = "host", .type = "TEXT" }, { .name = "serial", .type = "INTEGER" }, { .name = "trace_type", .type = "TEXT" }, { .name = "region", .type = "TEXT" }, { .name = "region_uuid", .type = "TEXT" }, { .name = "hpa", .type = "INTEGER" }, { .name = "dpa", .type = "INTEGER" }, { .name = "dpa_length", .type = "INTEGER" }, { .name = "source", .type = "TEXT" }, { .name = "flags", .type = "INTEGER" }, { .name = "overflow_ts", .type = "TEXT" }, }; static const struct db_table_descriptor cxl_poison_event_tab = { .name = "cxl_poison_event", .fields = cxl_poison_event_fields, .num_fields = ARRAY_SIZE(cxl_poison_event_fields), }; int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { int rc; struct sqlite3_priv *priv = ras->db_priv; if (!priv || !priv->stmt_cxl_poison_event) return 0; log(TERM, LOG_INFO, "cxl_poison_event store: %p\n", priv->stmt_cxl_poison_event); sqlite3_bind_text(priv->stmt_cxl_poison_event, 1, ev->timestamp, -1, NULL); sqlite3_bind_text(priv->stmt_cxl_poison_event, 2, ev->memdev, -1, NULL); sqlite3_bind_text(priv->stmt_cxl_poison_event, 3, ev->host, -1, NULL); sqlite3_bind_int64(priv->stmt_cxl_poison_event, 4, ev->serial); sqlite3_bind_text(priv->stmt_cxl_poison_event, 5, ev->trace_type, -1, NULL); sqlite3_bind_text(priv->stmt_cxl_poison_event, 6, ev->region, -1, NULL); sqlite3_bind_text(priv->stmt_cxl_poison_event, 7, ev->uuid, -1, NULL); sqlite3_bind_int64(priv->stmt_cxl_poison_event, 8, ev->hpa); sqlite3_bind_int64(priv->stmt_cxl_poison_event, 9, ev->dpa); sqlite3_bind_int(priv->stmt_cxl_poison_event, 10, ev->dpa_length); sqlite3_bind_text(priv->stmt_cxl_poison_event, 11, ev->source, -1, NULL); sqlite3_bind_int(priv->stmt_cxl_poison_event, 12, ev->flags); sqlite3_bind_text(priv->stmt_cxl_poison_event, 13, ev->overflow_ts, -1, NULL); rc = sqlite3_step(priv->stmt_cxl_poison_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do cxl_poison_event step on sqlite: error = %d\n", rc); rc = sqlite3_reset(priv->stmt_cxl_poison_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed reset cxl_poison_event on sqlite: error = %d\n", rc); log(TERM, LOG_INFO, "register inserted at db\n"); return rc; } /* * Table and functions to handle cxl:cxl_aer_uncorrectable_error */ static const struct db_fields cxl_aer_ue_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "memdev", .type = "TEXT" }, { .name = "host", .type = "TEXT" }, { .name = "serial", .type = "INTEGER" }, { .name = "error_status", .type = "INTEGER" }, { .name = "first_error", .type = "INTEGER" }, { .name = "header_log", .type = "BLOB" }, }; static const struct db_table_descriptor cxl_aer_ue_event_tab = { .name = "cxl_aer_ue_event", .fields = cxl_aer_ue_event_fields, .num_fields = ARRAY_SIZE(cxl_aer_ue_event_fields), }; int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev) { int rc; struct sqlite3_priv *priv = ras->db_priv; if (!priv || !priv->stmt_cxl_aer_ue_event) return 0; log(TERM, LOG_INFO, "cxl_aer_ue_event store: %p\n", priv->stmt_cxl_aer_ue_event); sqlite3_bind_text(priv->stmt_cxl_aer_ue_event, 1, ev->timestamp, -1, NULL); sqlite3_bind_text(priv->stmt_cxl_aer_ue_event, 2, ev->memdev, -1, NULL); sqlite3_bind_text(priv->stmt_cxl_aer_ue_event, 3, ev->host, -1, NULL); sqlite3_bind_int64(priv->stmt_cxl_aer_ue_event, 4, ev->serial); sqlite3_bind_int(priv->stmt_cxl_aer_ue_event, 5, ev->error_status); sqlite3_bind_int(priv->stmt_cxl_aer_ue_event, 6, ev->first_error); sqlite3_bind_blob(priv->stmt_cxl_aer_ue_event, 7, ev->header_log, CXL_HEADERLOG_SIZE, NULL); rc = sqlite3_step(priv->stmt_cxl_aer_ue_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do cxl_aer_ue_event step on sqlite: error = %d\n", rc); rc = sqlite3_reset(priv->stmt_cxl_aer_ue_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed reset cxl_aer_ue_event on sqlite: error = %d\n", rc); log(TERM, LOG_INFO, "register inserted at db\n"); return rc; } /* * Table and functions to handle cxl:cxl_aer_correctable_error */ static const struct db_fields cxl_aer_ce_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "memdev", .type = "TEXT" }, { .name = "host", .type = "TEXT" }, { .name = "serial", .type = "INTEGER" }, { .name = "error_status", .type = "INTEGER" }, }; static const struct db_table_descriptor cxl_aer_ce_event_tab = { .name = "cxl_aer_ce_event", .fields = cxl_aer_ce_event_fields, .num_fields = ARRAY_SIZE(cxl_aer_ce_event_fields), }; int ras_store_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev) { int rc; struct sqlite3_priv *priv = ras->db_priv; if (!priv || !priv->stmt_cxl_aer_ce_event) return 0; log(TERM, LOG_INFO, "cxl_aer_ce_event store: %p\n", priv->stmt_cxl_aer_ce_event); sqlite3_bind_text(priv->stmt_cxl_aer_ce_event, 1, ev->timestamp, -1, NULL); sqlite3_bind_text(priv->stmt_cxl_aer_ce_event, 2, ev->memdev, -1, NULL); sqlite3_bind_text(priv->stmt_cxl_aer_ce_event, 3, ev->host, -1, NULL); sqlite3_bind_int64(priv->stmt_cxl_aer_ce_event, 4, ev->serial); sqlite3_bind_int(priv->stmt_cxl_aer_ce_event, 5, ev->error_status); rc = sqlite3_step(priv->stmt_cxl_aer_ce_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do cxl_aer_ce_event step on sqlite: error = %d\n", rc); rc = sqlite3_reset(priv->stmt_cxl_aer_ce_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed reset cxl_aer_ce_event on sqlite: error = %d\n", rc); log(TERM, LOG_INFO, "register inserted at db\n"); return rc; } /* * Table and functions to handle cxl:cxl_overflow */ static const struct db_fields cxl_overflow_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "memdev", .type = "TEXT" }, { .name = "host", .type = "TEXT" }, { .name = "serial", .type = "INTEGER" }, { .name = "log_type", .type = "TEXT" }, { .name = "count", .type = "INTEGER" }, { .name = "first_ts", .type = "TEXT" }, { .name = "last_ts", .type = "TEXT" }, }; static const struct db_table_descriptor cxl_overflow_event_tab = { .name = "cxl_overflow_event", .fields = cxl_overflow_event_fields, .num_fields = ARRAY_SIZE(cxl_overflow_event_fields), }; int ras_store_cxl_overflow_event(struct ras_events *ras, struct ras_cxl_overflow_event *ev) { int rc; struct sqlite3_priv *priv = ras->db_priv; if (!priv || !priv->stmt_cxl_overflow_event) return 0; log(TERM, LOG_INFO, "cxl_overflow_event store: %p\n", priv->stmt_cxl_overflow_event); sqlite3_bind_text(priv->stmt_cxl_overflow_event, 1, ev->timestamp, -1, NULL); sqlite3_bind_text(priv->stmt_cxl_overflow_event, 2, ev->memdev, -1, NULL); sqlite3_bind_text(priv->stmt_cxl_overflow_event, 3, ev->host, -1, NULL); sqlite3_bind_int64(priv->stmt_cxl_overflow_event, 4, ev->serial); sqlite3_bind_text(priv->stmt_cxl_overflow_event, 5, ev->log_type, -1, NULL); sqlite3_bind_int(priv->stmt_cxl_overflow_event, 6, ev->count); sqlite3_bind_text(priv->stmt_cxl_overflow_event, 7, ev->first_ts, -1, NULL); sqlite3_bind_text(priv->stmt_cxl_overflow_event, 8, ev->last_ts, -1, NULL); rc = sqlite3_step(priv->stmt_cxl_overflow_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do cxl_overflow_event step on sqlite: error = %d\n", rc); rc = sqlite3_reset(priv->stmt_cxl_overflow_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed reset cxl_overflow_event on sqlite: error = %d\n", rc); log(TERM, LOG_INFO, "register inserted at db\n"); return rc; } static int ras_store_cxl_common_hdr(sqlite3_stmt *stmt, struct ras_cxl_event_common_hdr *hdr) { if (!stmt || !hdr) return 0; sqlite3_bind_text(stmt, 1, hdr->timestamp, -1, NULL); sqlite3_bind_text(stmt, 2, hdr->memdev, -1, NULL); sqlite3_bind_text(stmt, 3, hdr->host, -1, NULL); sqlite3_bind_int64(stmt, 4, hdr->serial); sqlite3_bind_text(stmt, 5, hdr->log_type, -1, NULL); sqlite3_bind_text(stmt, 6, hdr->hdr_uuid, -1, NULL); sqlite3_bind_int(stmt, 7, hdr->hdr_flags); sqlite3_bind_int(stmt, 8, hdr->hdr_handle); sqlite3_bind_int(stmt, 9, hdr->hdr_related_handle); sqlite3_bind_text(stmt, 10, hdr->hdr_timestamp, -1, NULL); sqlite3_bind_int(stmt, 11, hdr->hdr_length); sqlite3_bind_int(stmt, 12, hdr->hdr_maint_op_class); return 0; } /* * Table and functions to handle cxl:cxl_generic_event */ static const struct db_fields cxl_generic_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "memdev", .type = "TEXT" }, { .name = "host", .type = "TEXT" }, { .name = "serial", .type = "INTEGER" }, { .name = "log_type", .type = "TEXT" }, { .name = "hdr_uuid", .type = "TEXT" }, { .name = "hdr_flags", .type = "INTEGER" }, { .name = "hdr_handle", .type = "INTEGER" }, { .name = "hdr_related_handle", .type = "INTEGER" }, { .name = "hdr_ts", .type = "TEXT" }, { .name = "hdr_length", .type = "INTEGER" }, { .name = "hdr_maint_op_class", .type = "INTEGER" }, { .name = "data", .type = "BLOB" }, }; static const struct db_table_descriptor cxl_generic_event_tab = { .name = "cxl_generic_event", .fields = cxl_generic_event_fields, .num_fields = ARRAY_SIZE(cxl_generic_event_fields), }; int ras_store_cxl_generic_event(struct ras_events *ras, struct ras_cxl_generic_event *ev) { int rc; struct sqlite3_priv *priv = ras->db_priv; if (!priv || !priv->stmt_cxl_generic_event) return 0; log(TERM, LOG_INFO, "cxl_generic_event store: %p\n", priv->stmt_cxl_generic_event); ras_store_cxl_common_hdr(priv->stmt_cxl_generic_event, &ev->hdr); sqlite3_bind_blob(priv->stmt_cxl_generic_event, 13, ev->data, CXL_EVENT_RECORD_DATA_LENGTH, NULL); rc = sqlite3_step(priv->stmt_cxl_generic_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do stmt_cxl_generic_event step on sqlite: error = %d\n", rc); rc = sqlite3_reset(priv->stmt_cxl_generic_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed reset stmt_cxl_generic_event on sqlite: error = %d\n", rc); log(TERM, LOG_INFO, "register inserted at db\n"); return rc; } /* * Table and functions to handle cxl:cxl_general_media_event */ static const struct db_fields cxl_general_media_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "memdev", .type = "TEXT" }, { .name = "host", .type = "TEXT" }, { .name = "serial", .type = "INTEGER" }, { .name = "log_type", .type = "TEXT" }, { .name = "hdr_uuid", .type = "TEXT" }, { .name = "hdr_flags", .type = "INTEGER" }, { .name = "hdr_handle", .type = "INTEGER" }, { .name = "hdr_related_handle", .type = "INTEGER" }, { .name = "hdr_ts", .type = "TEXT" }, { .name = "hdr_length", .type = "INTEGER" }, { .name = "hdr_maint_op_class", .type = "INTEGER" }, { .name = "dpa", .type = "INTEGER" }, { .name = "dpa_flags", .type = "INTEGER" }, { .name = "descriptor", .type = "INTEGER" }, { .name = "type", .type = "INTEGER" }, { .name = "transaction_type", .type = "INTEGER" }, { .name = "channel", .type = "INTEGER" }, { .name = "rank", .type = "INTEGER" }, { .name = "device", .type = "INTEGER" }, { .name = "comp_id", .type = "BLOB" }, }; static const struct db_table_descriptor cxl_general_media_event_tab = { .name = "cxl_general_media_event", .fields = cxl_general_media_event_fields, .num_fields = ARRAY_SIZE(cxl_general_media_event_fields), }; int ras_store_cxl_general_media_event(struct ras_events *ras, struct ras_cxl_general_media_event *ev) { int rc; struct sqlite3_priv *priv = ras->db_priv; if (!priv || !priv->stmt_cxl_general_media_event) return 0; log(TERM, LOG_INFO, "cxl_general_media_event store: %p\n", priv->stmt_cxl_general_media_event); ras_store_cxl_common_hdr(priv->stmt_cxl_general_media_event, &ev->hdr); sqlite3_bind_int64(priv->stmt_cxl_general_media_event, 13, ev->dpa); sqlite3_bind_int(priv->stmt_cxl_general_media_event, 14, ev->dpa_flags); sqlite3_bind_int(priv->stmt_cxl_general_media_event, 15, ev->descriptor); sqlite3_bind_int(priv->stmt_cxl_general_media_event, 16, ev->type); sqlite3_bind_int(priv->stmt_cxl_general_media_event, 17, ev->transaction_type); sqlite3_bind_int(priv->stmt_cxl_general_media_event, 18, ev->channel); sqlite3_bind_int(priv->stmt_cxl_general_media_event, 19, ev->rank); sqlite3_bind_int(priv->stmt_cxl_general_media_event, 20, ev->device); sqlite3_bind_blob(priv->stmt_cxl_general_media_event, 21, ev->comp_id, CXL_EVENT_GEN_MED_COMP_ID_SIZE, NULL); rc = sqlite3_step(priv->stmt_cxl_general_media_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do stmt_cxl_general_media_event step on sqlite: error = %d\n", rc); rc = sqlite3_reset(priv->stmt_cxl_general_media_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed reset stmt_cxl_general_media_event on sqlite: error = %d\n", rc); log(TERM, LOG_INFO, "register inserted at db\n"); return rc; } /* * Table and functions to handle cxl:cxl_dram_event */ static const struct db_fields cxl_dram_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "memdev", .type = "TEXT" }, { .name = "host", .type = "TEXT" }, { .name = "serial", .type = "INTEGER" }, { .name = "log_type", .type = "TEXT" }, { .name = "hdr_uuid", .type = "TEXT" }, { .name = "hdr_flags", .type = "INTEGER" }, { .name = "hdr_handle", .type = "INTEGER" }, { .name = "hdr_related_handle", .type = "INTEGER" }, { .name = "hdr_ts", .type = "TEXT" }, { .name = "hdr_length", .type = "INTEGER" }, { .name = "hdr_maint_op_class", .type = "INTEGER" }, { .name = "dpa", .type = "INTEGER" }, { .name = "dpa_flags", .type = "INTEGER" }, { .name = "descriptor", .type = "INTEGER" }, { .name = "type", .type = "INTEGER" }, { .name = "transaction_type", .type = "INTEGER" }, { .name = "channel", .type = "INTEGER" }, { .name = "rank", .type = "INTEGER" }, { .name = "nibble_mask", .type = "INTEGER" }, { .name = "bank_group", .type = "INTEGER" }, { .name = "bank", .type = "INTEGER" }, { .name = "row", .type = "INTEGER" }, { .name = "column", .type = "INTEGER" }, { .name = "cor_mask", .type = "BLOB" }, }; static const struct db_table_descriptor cxl_dram_event_tab = { .name = "cxl_dram_event", .fields = cxl_dram_event_fields, .num_fields = ARRAY_SIZE(cxl_dram_event_fields), }; int ras_store_cxl_dram_event(struct ras_events *ras, struct ras_cxl_dram_event *ev) { int rc; struct sqlite3_priv *priv = ras->db_priv; if (!priv || !priv->stmt_cxl_dram_event) return 0; log(TERM, LOG_INFO, "cxl_dram_event store: %p\n", priv->stmt_cxl_dram_event); ras_store_cxl_common_hdr(priv->stmt_cxl_dram_event, &ev->hdr); sqlite3_bind_int64(priv->stmt_cxl_dram_event, 13, ev->dpa); sqlite3_bind_int(priv->stmt_cxl_dram_event, 14, ev->dpa_flags); sqlite3_bind_int(priv->stmt_cxl_dram_event, 15, ev->descriptor); sqlite3_bind_int(priv->stmt_cxl_dram_event, 16, ev->type); sqlite3_bind_int(priv->stmt_cxl_dram_event, 17, ev->transaction_type); sqlite3_bind_int(priv->stmt_cxl_dram_event, 18, ev->channel); sqlite3_bind_int(priv->stmt_cxl_dram_event, 19, ev->rank); sqlite3_bind_int(priv->stmt_cxl_dram_event, 20, ev->nibble_mask); sqlite3_bind_int(priv->stmt_cxl_dram_event, 21, ev->bank_group); sqlite3_bind_int(priv->stmt_cxl_dram_event, 22, ev->bank); sqlite3_bind_int(priv->stmt_cxl_dram_event, 23, ev->row); sqlite3_bind_int(priv->stmt_cxl_dram_event, 24, ev->column); sqlite3_bind_blob(priv->stmt_cxl_dram_event, 25, ev->cor_mask, CXL_EVENT_DER_CORRECTION_MASK_SIZE, NULL); rc = sqlite3_step(priv->stmt_cxl_dram_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do stmt_cxl_dram_event step on sqlite: error = %d\n", rc); rc = sqlite3_reset(priv->stmt_cxl_dram_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed reset stmt_cxl_dram_event on sqlite: error = %d\n", rc); log(TERM, LOG_INFO, "register inserted at db\n"); return rc; } /* * Table and functions to handle cxl:cxl_memory_module_event */ static const struct db_fields cxl_memory_module_event_fields[] = { { .name = "id", .type = "INTEGER PRIMARY KEY" }, { .name = "timestamp", .type = "TEXT" }, { .name = "memdev", .type = "TEXT" }, { .name = "host", .type = "TEXT" }, { .name = "serial", .type = "INTEGER" }, { .name = "log_type", .type = "TEXT" }, { .name = "hdr_uuid", .type = "TEXT" }, { .name = "hdr_flags", .type = "INTEGER" }, { .name = "hdr_handle", .type = "INTEGER" }, { .name = "hdr_related_handle", .type = "INTEGER" }, { .name = "hdr_ts", .type = "TEXT" }, { .name = "hdr_length", .type = "INTEGER" }, { .name = "hdr_maint_op_class", .type = "INTEGER" }, { .name = "event_type", .type = "INTEGER" }, { .name = "health_status", .type = "INTEGER" }, { .name = "media_status", .type = "INTEGER" }, { .name = "life_used", .type = "INTEGER" }, { .name = "dirty_shutdown_cnt", .type = "INTEGER" }, { .name = "cor_vol_err_cnt", .type = "INTEGER" }, { .name = "cor_per_err_cnt", .type = "INTEGER" }, { .name = "device_temp", .type = "INTEGER" }, { .name = "add_status", .type = "INTEGER" }, }; static const struct db_table_descriptor cxl_memory_module_event_tab = { .name = "cxl_memory_module_event", .fields = cxl_memory_module_event_fields, .num_fields = ARRAY_SIZE(cxl_memory_module_event_fields), }; int ras_store_cxl_memory_module_event(struct ras_events *ras, struct ras_cxl_memory_module_event *ev) { int rc; struct sqlite3_priv *priv = ras->db_priv; if (!priv || !priv->stmt_cxl_memory_module_event) return 0; log(TERM, LOG_INFO, "cxl_memory_module_event store: %p\n", priv->stmt_cxl_memory_module_event); ras_store_cxl_common_hdr(priv->stmt_cxl_memory_module_event, &ev->hdr); sqlite3_bind_int(priv->stmt_cxl_memory_module_event, 13, ev->event_type); sqlite3_bind_int(priv->stmt_cxl_memory_module_event, 14, ev->health_status); sqlite3_bind_int(priv->stmt_cxl_memory_module_event, 15, ev->media_status); sqlite3_bind_int(priv->stmt_cxl_memory_module_event, 16, ev->life_used); sqlite3_bind_int(priv->stmt_cxl_memory_module_event, 17, ev->dirty_shutdown_cnt); sqlite3_bind_int(priv->stmt_cxl_memory_module_event, 18, ev->cor_vol_err_cnt); sqlite3_bind_int(priv->stmt_cxl_memory_module_event, 19, ev->cor_per_err_cnt); sqlite3_bind_int(priv->stmt_cxl_memory_module_event, 20, ev->device_temp); sqlite3_bind_int(priv->stmt_cxl_memory_module_event, 21, ev->add_status); rc = sqlite3_step(priv->stmt_cxl_memory_module_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed to do stmt_cxl_memory_module_event step on sqlite: error = %d\n", rc); rc = sqlite3_reset(priv->stmt_cxl_memory_module_event); if (rc != SQLITE_OK && rc != SQLITE_DONE) log(TERM, LOG_ERR, "Failed reset stmt_cxl_memory_module_event on sqlite: error = %d\n", rc); log(TERM, LOG_INFO, "register inserted at db\n"); return rc; } #endif /* * Generic code */ static int __ras_mc_prepare_stmt(struct sqlite3_priv *priv, sqlite3_stmt **stmt, const struct db_table_descriptor *db_tab) { int i, rc; char sql[1024], *p = sql, *end = sql + sizeof(sql); const struct db_fields *field; p += snprintf(p, end - p, "INSERT INTO %s (", db_tab->name); for (i = 0; i < db_tab->num_fields; i++) { field = &db_tab->fields[i]; p += snprintf(p, end - p, "%s", field->name); if (i < db_tab->num_fields - 1) p += snprintf(p, end - p, ", "); } p += snprintf(p, end - p, ") VALUES ( NULL, "); for (i = 1; i < db_tab->num_fields; i++) { if (i < db_tab->num_fields - 1) strcat(sql, "?, "); else strcat(sql, "?)"); } #ifdef DEBUG_SQL log(TERM, LOG_INFO, "SQL: %s\n", sql); #endif rc = sqlite3_prepare_v2(priv->db, sql, -1, stmt, NULL); if (rc != SQLITE_OK) { log(TERM, LOG_ERR, "Failed to prepare insert db at table %s (db %s): error = %s\n", db_tab->name, SQLITE_RAS_DB, sqlite3_errmsg(priv->db)); stmt = NULL; } else { log(TERM, LOG_INFO, "Recording %s events\n", db_tab->name); } return rc; } static int ras_mc_create_table(struct sqlite3_priv *priv, const struct db_table_descriptor *db_tab) { const struct db_fields *field; char sql[1024], *p = sql, *end = sql + sizeof(sql); int i, rc; p += snprintf(p, end - p, "CREATE TABLE IF NOT EXISTS %s (", db_tab->name); for (i = 0; i < db_tab->num_fields; i++) { field = &db_tab->fields[i]; p += snprintf(p, end - p, "%s %s", field->name, field->type); if (i < db_tab->num_fields - 1) p += snprintf(p, end - p, ", "); } p += snprintf(p, end - p, ")"); #ifdef DEBUG_SQL log(TERM, LOG_INFO, "SQL: %s\n", sql); #endif rc = sqlite3_exec(priv->db, sql, NULL, NULL, NULL); if (rc != SQLITE_OK) { log(TERM, LOG_ERR, "Failed to create table %s on %s: error = %d\n", db_tab->name, SQLITE_RAS_DB, rc); } return rc; } static int ras_mc_alter_table(struct sqlite3_priv *priv, sqlite3_stmt **stmt, const struct db_table_descriptor *db_tab) { char sql[1024], *p = sql, *end = sql + sizeof(sql); const struct db_fields *field; int col_count; int i, j, rc, found; snprintf(p, end - p, "SELECT * FROM %s", db_tab->name); rc = sqlite3_prepare_v2(priv->db, sql, -1, stmt, NULL); if (rc != SQLITE_OK) { log(TERM, LOG_ERR, "Failed to query fields from the table %s on %s: error = %d\n", db_tab->name, SQLITE_RAS_DB, rc); return rc; } col_count = sqlite3_column_count(*stmt); for (i = 0; i < db_tab->num_fields; i++) { field = &db_tab->fields[i]; found = 0; for (j = 0; j < col_count; j++) { if (!strcmp(field->name, sqlite3_column_name(*stmt, j))) { found = 1; break; } } if (!found) { /* add new field */ p += snprintf(p, end - p, "ALTER TABLE %s ADD ", db_tab->name); p += snprintf(p, end - p, "%s %s", field->name, field->type); #ifdef DEBUG_SQL log(TERM, LOG_INFO, "SQL: %s\n", sql); #endif rc = sqlite3_exec(priv->db, sql, NULL, NULL, NULL); if (rc != SQLITE_OK) { log(TERM, LOG_ERR, "Failed to add new field %s to the table %s on %s: error = %d\n", field->name, db_tab->name, SQLITE_RAS_DB, rc); return rc; } p = sql; memset(sql, 0, sizeof(sql)); } } return rc; } static int ras_mc_prepare_stmt(struct sqlite3_priv *priv, sqlite3_stmt **stmt, const struct db_table_descriptor *db_tab) { int rc; rc = __ras_mc_prepare_stmt(priv, stmt, db_tab); if (rc != SQLITE_OK) { log(TERM, LOG_ERR, "Failed to prepare insert db at table %s (db %s): error = %s\n", db_tab->name, SQLITE_RAS_DB, sqlite3_errmsg(priv->db)); log(TERM, LOG_INFO, "Trying to alter db at table %s (db %s)\n", db_tab->name, SQLITE_RAS_DB); rc = ras_mc_alter_table(priv, stmt, db_tab); if (rc != SQLITE_OK && rc != SQLITE_DONE) { log(TERM, LOG_ERR, "Failed to alter db at table %s (db %s): error = %s\n", db_tab->name, SQLITE_RAS_DB, sqlite3_errmsg(priv->db)); stmt = NULL; return rc; } rc = __ras_mc_prepare_stmt(priv, stmt, db_tab); } return rc; } int ras_mc_add_vendor_table(struct ras_events *ras, sqlite3_stmt **stmt, const struct db_table_descriptor *db_tab) { int rc; struct sqlite3_priv *priv = ras->db_priv; if (!priv) return -1; rc = ras_mc_create_table(priv, db_tab); if (rc == SQLITE_OK) rc = ras_mc_prepare_stmt(priv, stmt, db_tab); return rc; } int ras_mc_finalize_vendor_table(sqlite3_stmt *stmt) { int rc; rc = sqlite3_finalize(stmt); if (rc != SQLITE_OK) log(TERM, LOG_ERR, "Failed to finalize sqlite: error = %d\n", rc); return rc; } int ras_mc_event_opendb(unsigned int cpu, struct ras_events *ras) { int rc; sqlite3 *db; struct sqlite3_priv *priv; printf("Calling %s()\n", __func__); ras->db_ref_count++; if (ras->db_ref_count > 1) return 0; ras->db_priv = NULL; priv = calloc(1, sizeof(*priv)); if (!priv) return -1; struct stat st = {0}; if (stat(RASSTATEDIR, &st) == -1) { if (errno != ENOENT) { log(TERM, LOG_ERR, "Failed to read state directory " RASSTATEDIR); goto error; } if (mkdir(RASSTATEDIR, 0700) == -1) { log(TERM, LOG_ERR, "Failed to create state directory " RASSTATEDIR); goto error; } } rc = sqlite3_initialize(); if (rc != SQLITE_OK) { log(TERM, LOG_ERR, "cpu %u: Failed to initialize sqlite: error = %d\n", cpu, rc); goto error; } do { rc = sqlite3_open_v2(SQLITE_RAS_DB, &db, SQLITE_OPEN_FULLMUTEX | SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE, NULL); if (rc == SQLITE_BUSY) usleep(10000); } while (rc == SQLITE_BUSY); if (rc != SQLITE_OK) { log(TERM, LOG_ERR, "cpu %u: Failed to connect to %s: error = %d\n", cpu, SQLITE_RAS_DB, rc); goto error; } priv->db = db; rc = ras_mc_create_table(priv, &mc_event_tab); if (rc == SQLITE_OK) { rc = ras_mc_prepare_stmt(priv, &priv->stmt_mc_event, &mc_event_tab); if (rc != SQLITE_OK) goto error; } #ifdef HAVE_AER rc = ras_mc_create_table(priv, &aer_event_tab); if (rc == SQLITE_OK) { rc = ras_mc_prepare_stmt(priv, &priv->stmt_aer_event, &aer_event_tab); if (rc != SQLITE_OK) goto error; } #endif #ifdef HAVE_EXTLOG rc = ras_mc_create_table(priv, &extlog_event_tab); if (rc == SQLITE_OK) { rc = ras_mc_prepare_stmt(priv, &priv->stmt_extlog_record, &extlog_event_tab); if (rc != SQLITE_OK) goto error; } #endif #ifdef HAVE_MCE rc = ras_mc_create_table(priv, &mce_record_tab); if (rc == SQLITE_OK) { rc = ras_mc_prepare_stmt(priv, &priv->stmt_mce_record, &mce_record_tab); if (rc != SQLITE_OK) goto error; } #endif #ifdef HAVE_NON_STANDARD rc = ras_mc_create_table(priv, &non_standard_event_tab); if (rc == SQLITE_OK) { rc = ras_mc_prepare_stmt(priv, &priv->stmt_non_standard_record, &non_standard_event_tab); if (rc != SQLITE_OK) goto error; } #endif #ifdef HAVE_ARM rc = ras_mc_create_table(priv, &arm_event_tab); if (rc == SQLITE_OK) { rc = ras_mc_prepare_stmt(priv, &priv->stmt_arm_record, &arm_event_tab); if (rc != SQLITE_OK) goto error; } #endif #ifdef HAVE_DEVLINK rc = ras_mc_create_table(priv, &devlink_event_tab); if (rc == SQLITE_OK) { rc = ras_mc_prepare_stmt(priv, &priv->stmt_devlink_event, &devlink_event_tab); if (rc != SQLITE_OK) goto error; } #endif #ifdef HAVE_DISKERROR rc = ras_mc_create_table(priv, &diskerror_event_tab); if (rc == SQLITE_OK) { rc = ras_mc_prepare_stmt(priv, &priv->stmt_diskerror_event, &diskerror_event_tab); if (rc != SQLITE_OK) goto error; } #endif #ifdef HAVE_MEMORY_FAILURE rc = ras_mc_create_table(priv, &mf_event_tab); if (rc == SQLITE_OK) { rc = ras_mc_prepare_stmt(priv, &priv->stmt_mf_event, &mf_event_tab); if (rc != SQLITE_OK) goto error; } #endif #ifdef HAVE_CXL rc = ras_mc_create_table(priv, &cxl_poison_event_tab); if (rc == SQLITE_OK) { rc = ras_mc_prepare_stmt(priv, &priv->stmt_cxl_poison_event, &cxl_poison_event_tab); if (rc != SQLITE_OK) goto error; } rc = ras_mc_create_table(priv, &cxl_aer_ue_event_tab); if (rc == SQLITE_OK) { rc = ras_mc_prepare_stmt(priv, &priv->stmt_cxl_aer_ue_event, &cxl_aer_ue_event_tab); if (rc != SQLITE_OK) goto error; } rc = ras_mc_create_table(priv, &cxl_aer_ce_event_tab); if (rc == SQLITE_OK) { rc = ras_mc_prepare_stmt(priv, &priv->stmt_cxl_aer_ce_event, &cxl_aer_ce_event_tab); if (rc != SQLITE_OK) goto error; } rc = ras_mc_create_table(priv, &cxl_overflow_event_tab); if (rc == SQLITE_OK) { rc = ras_mc_prepare_stmt(priv, &priv->stmt_cxl_overflow_event, &cxl_overflow_event_tab); if (rc != SQLITE_OK) goto error; } rc = ras_mc_create_table(priv, &cxl_generic_event_tab); if (rc == SQLITE_OK) { rc = ras_mc_prepare_stmt(priv, &priv->stmt_cxl_generic_event, &cxl_generic_event_tab); if (rc != SQLITE_OK) goto error; } rc = ras_mc_create_table(priv, &cxl_general_media_event_tab); if (rc == SQLITE_OK) { rc = ras_mc_prepare_stmt(priv, &priv->stmt_cxl_general_media_event, &cxl_general_media_event_tab); if (rc != SQLITE_OK) goto error; } rc = ras_mc_create_table(priv, &cxl_dram_event_tab); if (rc == SQLITE_OK) { rc = ras_mc_prepare_stmt(priv, &priv->stmt_cxl_dram_event, &cxl_dram_event_tab); if (rc != SQLITE_OK) goto error; } rc = ras_mc_create_table(priv, &cxl_memory_module_event_tab); if (rc == SQLITE_OK) { rc = ras_mc_prepare_stmt(priv, &priv->stmt_cxl_memory_module_event, &cxl_memory_module_event_tab); if (rc != SQLITE_OK) goto error; } #endif ras->db_priv = priv; return 0; error: free(priv); return -1; } int ras_mc_event_closedb(unsigned int cpu, struct ras_events *ras) { int rc; sqlite3 *db; struct sqlite3_priv *priv = ras->db_priv; printf("Calling %s()\n", __func__); if (ras->db_ref_count > 0) ras->db_ref_count--; else return -1; if (ras->db_ref_count > 0) return 0; if (!priv) return -1; db = priv->db; if (!db) return -1; if (priv->stmt_mc_event) { rc = sqlite3_finalize(priv->stmt_mc_event); if (rc != SQLITE_OK) log(TERM, LOG_ERR, "cpu %u: Failed to finalize mc_event sqlite: error = %d\n", cpu, rc); } #ifdef HAVE_AER if (priv->stmt_aer_event) { rc = sqlite3_finalize(priv->stmt_aer_event); if (rc != SQLITE_OK) log(TERM, LOG_ERR, "cpu %u: Failed to finalize aer_event sqlite: error = %d\n", cpu, rc); } #endif #ifdef HAVE_EXTLOG if (priv->stmt_extlog_record) { rc = sqlite3_finalize(priv->stmt_extlog_record); if (rc != SQLITE_OK) log(TERM, LOG_ERR, "cpu %u: Failed to finalize extlog_record sqlite: error = %d\n", cpu, rc); } #endif #ifdef HAVE_MCE if (priv->stmt_mce_record) { rc = sqlite3_finalize(priv->stmt_mce_record); if (rc != SQLITE_OK) log(TERM, LOG_ERR, "cpu %u: Failed to finalize mce_record sqlite: error = %d\n", cpu, rc); } #endif #ifdef HAVE_NON_STANDARD if (priv->stmt_non_standard_record) { rc = sqlite3_finalize(priv->stmt_non_standard_record); if (rc != SQLITE_OK) log(TERM, LOG_ERR, "cpu %u: Failed to finalize non_standard_record sqlite: error = %d\n", cpu, rc); } #endif #ifdef HAVE_ARM if (priv->stmt_arm_record) { rc = sqlite3_finalize(priv->stmt_arm_record); if (rc != SQLITE_OK) log(TERM, LOG_ERR, "cpu %u: Failed to finalize arm_record sqlite: error = %d\n", cpu, rc); } #endif #ifdef HAVE_DEVLINK if (priv->stmt_devlink_event) { rc = sqlite3_finalize(priv->stmt_devlink_event); if (rc != SQLITE_OK) log(TERM, LOG_ERR, "cpu %u: Failed to finalize devlink_event sqlite: error = %d\n", cpu, rc); } #endif #ifdef HAVE_DISKERROR if (priv->stmt_diskerror_event) { rc = sqlite3_finalize(priv->stmt_diskerror_event); if (rc != SQLITE_OK) log(TERM, LOG_ERR, "cpu %u: Failed to finalize diskerror_event sqlite: error = %d\n", cpu, rc); } #endif #ifdef HAVE_MEMORY_FAILURE if (priv->stmt_mf_event) { rc = sqlite3_finalize(priv->stmt_mf_event); if (rc != SQLITE_OK) log(TERM, LOG_ERR, "cpu %u: Failed to finalize mf_event sqlite: error = %d\n", cpu, rc); } #endif #ifdef HAVE_CXL if (priv->stmt_cxl_poison_event) { rc = sqlite3_finalize(priv->stmt_cxl_poison_event); if (rc != SQLITE_OK) log(TERM, LOG_ERR, "cpu %u: Failed to finalize cxl_poison_event sqlite: error = %d\n", cpu, rc); } if (priv->stmt_cxl_aer_ue_event) { rc = sqlite3_finalize(priv->stmt_cxl_aer_ue_event); if (rc != SQLITE_OK) log(TERM, LOG_ERR, "cpu %u: Failed to finalize cxl_aer_ue_event sqlite: error = %d\n", cpu, rc); } if (priv->stmt_cxl_aer_ce_event) { rc = sqlite3_finalize(priv->stmt_cxl_aer_ce_event); if (rc != SQLITE_OK) log(TERM, LOG_ERR, "cpu %u: Failed to finalize cxl_aer_ce_event sqlite: error = %d\n", cpu, rc); } if (priv->stmt_cxl_overflow_event) { rc = sqlite3_finalize(priv->stmt_cxl_overflow_event); if (rc != SQLITE_OK) log(TERM, LOG_ERR, "cpu %u: Failed to finalize cxl_overflow_event sqlite: error = %d\n", cpu, rc); } if (priv->stmt_cxl_generic_event) { rc = sqlite3_finalize(priv->stmt_cxl_generic_event); if (rc != SQLITE_OK) log(TERM, LOG_ERR, "cpu %u: Failed to finalize cxl_generic_event sqlite: error = %d\n", cpu, rc); } if (priv->stmt_cxl_general_media_event) { rc = sqlite3_finalize(priv->stmt_cxl_general_media_event); if (rc != SQLITE_OK) log(TERM, LOG_ERR, "cpu %u: Failed to finalize cxl_general_media_event sqlite: error = %d\n", cpu, rc); } if (priv->stmt_cxl_dram_event) { rc = sqlite3_finalize(priv->stmt_cxl_dram_event); if (rc != SQLITE_OK) log(TERM, LOG_ERR, "cpu %u: Failed to finalize cxl_dram_event sqlite: error = %d\n", cpu, rc); } if (priv->stmt_cxl_memory_module_event) { rc = sqlite3_finalize(priv->stmt_cxl_memory_module_event); if (rc != SQLITE_OK) log(TERM, LOG_ERR, "cpu %u: Failed to finalize stmt_cxl_memory_module_event sqlite: error = %d\n", cpu, rc); } #endif rc = sqlite3_close_v2(db); if (rc != SQLITE_OK) log(TERM, LOG_ERR, "cpu %u: Failed to close sqlite: error = %d\n", cpu, rc); rc = sqlite3_shutdown(); if (rc != SQLITE_OK) log(TERM, LOG_ERR, "cpu %u: Failed to shutdown sqlite: error = %d\n", cpu, rc); free(priv); ras->db_priv = NULL; return 0; } 07070100000062000081A400000000000000000000000165C04BE400002A17000000000000000000000000000000000000002C00000000rasdaemon-0.8.0.49.git+f9cb13b/ras-record.h/* * Copyright (C) 2013 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> * Copyright (c) 2016, The Linux Foundation. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #ifndef __RAS_RECORD_H #define __RAS_RECORD_H #include <stdint.h> #include <stdbool.h> #include "config.h" #define ARRAY_SIZE(x) (sizeof(x)/sizeof(*(x))) #define BIT(nr) (1UL << (nr)) #define BIT_ULL(nr) (1ULL << (nr)) extern long user_hz; struct ras_events; struct ras_mc_event { char timestamp[64]; int error_count; const char *error_type, *msg, *label; unsigned char mc_index; signed char top_layer, middle_layer, lower_layer; unsigned long long address, grain, syndrome; const char *driver_detail; }; struct ras_mc_offline_event { unsigned int family, model; bool smca; uint8_t bank; uint64_t ipid; uint64_t synd; uint64_t status; }; struct ras_aer_event { char timestamp[64]; const char *error_type; const char *dev_name; uint8_t tlp_header_valid; uint32_t *tlp_header; const char *msg; }; struct ras_extlog_event { char timestamp[64]; int32_t error_seq; int8_t etype; int8_t severity; unsigned long long address; int8_t pa_mask_lsb; const char *fru_id; const char *fru_text; const char *cper_data; unsigned short cper_data_length; }; struct ras_non_standard_event { char timestamp[64]; const char *sec_type, *fru_id, *fru_text; const char *severity; const uint8_t *error; uint32_t length; }; struct ras_arm_event { char timestamp[64]; int32_t error_count; int8_t affinity; int64_t mpidr; int64_t midr; int32_t running_state; int32_t psci_state; const uint8_t *pei_error; uint32_t pei_len; const uint8_t *ctx_error; uint32_t ctx_len; const uint8_t *vsei_error; uint32_t oem_len; }; struct devlink_event { char timestamp[64]; const char *bus_name; const char *dev_name; const char *driver_name; const char *reporter_name; char *msg; }; struct diskerror_event { char timestamp[64]; char *dev; unsigned long long sector; unsigned int nr_sector; const char *error; const char *rwbs; const char *cmd; }; struct ras_mf_event { char timestamp[64]; char pfn[30]; const char *page_type; const char *action_result; }; struct ras_cxl_poison_event { char timestamp[64]; const char *memdev; const char *host; uint64_t serial; const char *trace_type; const char *region; const char *uuid; uint64_t hpa; uint64_t dpa; uint32_t dpa_length; const char *source; uint8_t flags; char overflow_ts[64]; }; #define SZ_512 0x200 #define CXL_HEADERLOG_SIZE SZ_512 #define CXL_HEADERLOG_SIZE_U32 (SZ_512 / sizeof(uint32_t)) #define CXL_EVENT_RECORD_DATA_LENGTH 0x50 #define CXL_EVENT_GEN_MED_COMP_ID_SIZE 0x10 #define CXL_EVENT_DER_CORRECTION_MASK_SIZE 0x20 struct ras_cxl_aer_ue_event { char timestamp[64]; const char *memdev; const char *host; uint64_t serial; uint32_t error_status; uint32_t first_error; uint32_t *header_log; }; struct ras_cxl_aer_ce_event { char timestamp[64]; const char *memdev; const char *host; uint64_t serial; uint32_t error_status; }; struct ras_cxl_overflow_event { char timestamp[64]; const char *memdev; const char *host; uint64_t serial; const char *log_type; char first_ts[64]; char last_ts[64]; uint16_t count; }; struct ras_cxl_event_common_hdr { char timestamp[64]; const char *memdev; const char *host; uint64_t serial; const char *log_type; const char *hdr_uuid; uint32_t hdr_flags; uint16_t hdr_handle; uint16_t hdr_related_handle; char hdr_timestamp[64]; uint8_t hdr_length; uint8_t hdr_maint_op_class; }; struct ras_cxl_generic_event { struct ras_cxl_event_common_hdr hdr; uint8_t *data; }; struct ras_cxl_general_media_event { struct ras_cxl_event_common_hdr hdr; uint64_t dpa; uint8_t dpa_flags; uint8_t descriptor; uint8_t type; uint8_t transaction_type; uint8_t channel; uint8_t rank; uint32_t device; uint8_t *comp_id; uint16_t validity_flags; }; struct ras_cxl_dram_event { struct ras_cxl_event_common_hdr hdr; uint64_t dpa; uint8_t dpa_flags; uint8_t descriptor; uint8_t type; uint8_t transaction_type; uint8_t channel; uint8_t rank; uint32_t nibble_mask; uint8_t bank_group; uint8_t bank; uint32_t row; uint16_t column; uint8_t *cor_mask; uint16_t validity_flags; }; struct ras_cxl_memory_module_event { struct ras_cxl_event_common_hdr hdr; uint8_t event_type; uint8_t health_status; uint8_t media_status; uint8_t life_used; uint32_t dirty_shutdown_cnt; uint32_t cor_vol_err_cnt; uint32_t cor_per_err_cnt; int16_t device_temp; uint8_t add_status; }; struct ras_mc_event; struct ras_aer_event; struct ras_extlog_event; struct ras_non_standard_event; struct ras_arm_event; struct mce_event; struct devlink_event; struct diskerror_event; struct ras_mf_event; struct ras_cxl_poison_event; struct ras_cxl_aer_ue_event; struct ras_cxl_aer_ce_event; struct ras_cxl_overflow_event; struct ras_cxl_generic_event; struct ras_cxl_general_media_event; struct ras_cxl_dram_event; struct ras_cxl_memory_module_event; #ifdef HAVE_SQLITE3 #include <sqlite3.h> struct sqlite3_priv { sqlite3 *db; sqlite3_stmt *stmt_mc_event; #ifdef HAVE_AER sqlite3_stmt *stmt_aer_event; #endif #ifdef HAVE_MCE sqlite3_stmt *stmt_mce_record; #endif #ifdef HAVE_EXTLOG sqlite3_stmt *stmt_extlog_record; #endif #ifdef HAVE_NON_STANDARD sqlite3_stmt *stmt_non_standard_record; #endif #ifdef HAVE_ARM sqlite3_stmt *stmt_arm_record; #endif #ifdef HAVE_DEVLINK sqlite3_stmt *stmt_devlink_event; #endif #ifdef HAVE_DISKERROR sqlite3_stmt *stmt_diskerror_event; #endif #ifdef HAVE_MEMORY_FAILURE sqlite3_stmt *stmt_mf_event; #endif #ifdef HAVE_CXL sqlite3_stmt *stmt_cxl_poison_event; sqlite3_stmt *stmt_cxl_aer_ue_event; sqlite3_stmt *stmt_cxl_aer_ce_event; sqlite3_stmt *stmt_cxl_overflow_event; sqlite3_stmt *stmt_cxl_generic_event; sqlite3_stmt *stmt_cxl_general_media_event; sqlite3_stmt *stmt_cxl_dram_event; sqlite3_stmt *stmt_cxl_memory_module_event; #endif }; struct db_fields { char *name; char *type; }; struct db_table_descriptor { char *name; const struct db_fields *fields; size_t num_fields; }; int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras); int ras_mc_event_closedb(unsigned int cpu, struct ras_events *ras); int ras_mc_add_vendor_table(struct ras_events *ras, sqlite3_stmt **stmt, const struct db_table_descriptor *db_tab); int ras_mc_finalize_vendor_table(sqlite3_stmt *stmt); int ras_store_mc_event(struct ras_events *ras, struct ras_mc_event *ev); int ras_store_aer_event(struct ras_events *ras, struct ras_aer_event *ev); int ras_store_mce_record(struct ras_events *ras, struct mce_event *ev); int ras_store_extlog_mem_record(struct ras_events *ras, struct ras_extlog_event *ev); int ras_store_non_standard_record(struct ras_events *ras, struct ras_non_standard_event *ev); int ras_store_arm_record(struct ras_events *ras, struct ras_arm_event *ev); int ras_store_devlink_event(struct ras_events *ras, struct devlink_event *ev); int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev); int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev); int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev); int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev); int ras_store_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev); int ras_store_cxl_overflow_event(struct ras_events *ras, struct ras_cxl_overflow_event *ev); int ras_store_cxl_generic_event(struct ras_events *ras, struct ras_cxl_generic_event *ev); int ras_store_cxl_general_media_event(struct ras_events *ras, struct ras_cxl_general_media_event *ev); int ras_store_cxl_dram_event(struct ras_events *ras, struct ras_cxl_dram_event *ev); int ras_store_cxl_memory_module_event(struct ras_events *ras, struct ras_cxl_memory_module_event *ev); #else static inline int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras) { return 0; }; static inline int ras_mc_event_closedb(unsigned int cpu, struct ras_events *ras) { return 0; }; static inline int ras_store_mc_event(struct ras_events *ras, struct ras_mc_event *ev) { return 0; }; static inline int ras_store_aer_event(struct ras_events *ras, struct ras_aer_event *ev) { return 0; }; static inline int ras_store_mce_record(struct ras_events *ras, struct mce_event *ev) { return 0; }; static inline int ras_store_extlog_mem_record(struct ras_events *ras, struct ras_extlog_event *ev) { return 0; }; static inline int ras_store_non_standard_record(struct ras_events *ras, struct ras_non_standard_event *ev) { return 0; }; static inline int ras_store_arm_record(struct ras_events *ras, struct ras_arm_event *ev) { return 0; }; static inline int ras_store_devlink_event(struct ras_events *ras, struct devlink_event *ev) { return 0; }; static inline int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev) { return 0; }; static inline int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; }; static inline int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; }; static inline int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev) { return 0; }; static inline int ras_store_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev) { return 0; }; static inline int ras_store_cxl_overflow_event(struct ras_events *ras, struct ras_cxl_overflow_event *ev) { return 0; }; static inline int ras_store_cxl_generic_event(struct ras_events *ras, struct ras_cxl_generic_event *ev) { return 0; }; static inline int ras_store_cxl_general_media_event(struct ras_events *ras, struct ras_cxl_general_media_event *ev) { return 0; }; static inline int ras_store_cxl_dram_event(struct ras_events *ras, struct ras_cxl_dram_event *ev) { return 0; }; static inline int ras_store_cxl_memory_module_event(struct ras_events *ras, struct ras_cxl_memory_module_event *ev) { return 0; }; #endif #endif 07070100000063000081A400000000000000000000000165C04BE400007590000000000000000000000000000000000000002C00000000rasdaemon-0.8.0.49.git+f9cb13b/ras-report.c/* * Copyright (c) 2016, The Linux Foundation. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 and * only version 2 as published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. */ #include <stdio.h> #include <string.h> #include <unistd.h> #include <sys/types.h> #include <sys/utsname.h> #include <sys/socket.h> #include <sys/un.h> #include "ras-report.h" static int setup_report_socket(void) { int sockfd = -1; int rc = -1; struct sockaddr_un addr; sockfd = socket(AF_UNIX, SOCK_STREAM, 0); if (sockfd < 0) { return -1; } memset(&addr, 0, sizeof(struct sockaddr_un)); addr.sun_family = AF_UNIX; strncpy(addr.sun_path, ABRT_SOCKET, sizeof(addr.sun_path)); addr.sun_path[sizeof(addr.sun_path) - 1] = '\0'; rc = connect(sockfd, (struct sockaddr *)&addr, sizeof(struct sockaddr_un)); if (rc < 0) { close(sockfd); return -1; } return sockfd; } static int commit_report_basic(int sockfd) { char buf[INPUT_BUFFER_SIZE]; struct utsname un; int rc = -1; if (sockfd < 0) { return rc; } memset(buf, 0, INPUT_BUFFER_SIZE); memset(&un, 0, sizeof(struct utsname)); rc = uname(&un); if (rc < 0) { return rc; } /* * ABRT server protocol */ sprintf(buf, "PUT / HTTP/1.1\r\n\r\n"); rc = write(sockfd, buf, strlen(buf)); if (rc < strlen(buf)) { return -1; } sprintf(buf, "PID=%d", (int)getpid()); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) { return -1; } sprintf(buf, "EXECUTABLE=/boot/vmlinuz-%s", un.release); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) { return -1; } sprintf(buf, "TYPE=%s", "ras"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) { return -1; } return 0; } static int set_mc_event_backtrace(char *buf, struct ras_mc_event *ev) { char bt_buf[MAX_BACKTRACE_SIZE]; if (!buf || !ev) return -1; sprintf(bt_buf, "BACKTRACE=" \ "timestamp=%s\n" \ "error_count=%d\n" \ "error_type=%s\n" \ "msg=%s\n" \ "label=%s\n" \ "mc_index=%c\n" \ "top_layer=%c\n" \ "middle_layer=%c\n" \ "lower_layer=%c\n" \ "address=%llu\n" \ "grain=%llu\n" \ "syndrome=%llu\n" \ "driver_detail=%s\n", \ ev->timestamp, \ ev->error_count, \ ev->error_type, \ ev->msg, \ ev->label, \ ev->mc_index, \ ev->top_layer, \ ev->middle_layer, \ ev->lower_layer, \ ev->address, \ ev->grain, \ ev->syndrome, \ ev->driver_detail); strcat(buf, bt_buf); return 0; } static int set_mce_event_backtrace(char *buf, struct mce_event *ev) { char bt_buf[MAX_BACKTRACE_SIZE]; if (!buf || !ev) return -1; sprintf(bt_buf, "BACKTRACE=" \ "timestamp=%s\n" \ "bank_name=%s\n" \ "error_msg=%s\n" \ "mcgstatus_msg=%s\n" \ "mcistatus_msg=%s\n" \ "mcastatus_msg=%s\n" \ "user_action=%s\n" \ "mc_location=%s\n" \ "mcgcap=%lu\n" \ "mcgstatus=%lu\n" \ "status=%lu\n" \ "addr=%lu\n" \ "misc=%lu\n" \ "ip=%lu\n" \ "tsc=%lu\n" \ "walltime=%lu\n" \ "cpu=%u\n" \ "cpuid=%u\n" \ "apicid=%u\n" \ "socketid=%u\n" \ "cs=%d\n" \ "bank=%d\n" \ "cpuvendor=%d\n", \ ev->timestamp, \ ev->bank_name, \ ev->error_msg, \ ev->mcgstatus_msg, \ ev->mcistatus_msg, \ ev->mcastatus_msg, \ ev->user_action, \ ev->mc_location, \ ev->mcgcap, \ ev->mcgstatus, \ ev->status, \ ev->addr, \ ev->misc, \ ev->ip, \ ev->tsc, \ ev->walltime, \ ev->cpu, \ ev->cpuid, \ ev->apicid, \ ev->socketid, \ ev->cs, \ ev->bank, \ ev->cpuvendor); strcat(buf, bt_buf); return 0; } static int set_aer_event_backtrace(char *buf, struct ras_aer_event *ev) { char bt_buf[MAX_BACKTRACE_SIZE]; if (!buf || !ev) return -1; sprintf(bt_buf, "BACKTRACE=" \ "timestamp=%s\n" \ "error_type=%s\n" \ "dev_name=%s\n" \ "msg=%s\n", \ ev->timestamp, \ ev->error_type, \ ev->dev_name, \ ev->msg); strcat(buf, bt_buf); return 0; } static int set_non_standard_event_backtrace(char *buf, struct ras_non_standard_event *ev) { char bt_buf[MAX_BACKTRACE_SIZE]; if (!buf || !ev) return -1; sprintf(bt_buf, "BACKTRACE=" \ "timestamp=%s\n" \ "severity=%s\n" \ "length=%d\n", \ ev->timestamp, \ ev->severity, \ ev->length); strcat(buf, bt_buf); return 0; } static int set_arm_event_backtrace(char *buf, struct ras_arm_event *ev) { char bt_buf[MAX_BACKTRACE_SIZE]; if (!buf || !ev) return -1; sprintf(bt_buf, "BACKTRACE=" \ "timestamp=%s\n" \ "error_count=%d\n" \ "affinity=%d\n" \ "mpidr=0x%lx\n" \ "midr=0x%lx\n" \ "running_state=%d\n" \ "psci_state=%d\n", \ ev->timestamp, \ ev->error_count, \ ev->affinity, \ ev->mpidr, \ ev->midr, \ ev->running_state, \ ev->psci_state); strcat(buf, bt_buf); return 0; } static int set_devlink_event_backtrace(char *buf, struct devlink_event *ev) { char bt_buf[MAX_BACKTRACE_SIZE]; if (!buf || !ev) return -1; sprintf(bt_buf, "BACKTRACE=" \ "timestamp=%s\n" \ "bus_name=%s\n" \ "dev_name=%s\n" \ "driver_name=%s\n" \ "reporter_name=%s\n" \ "msg=%s\n", \ ev->timestamp, \ ev->bus_name, \ ev->dev_name, \ ev->driver_name, \ ev->reporter_name, \ ev->msg); strcat(buf, bt_buf); return 0; } static int set_diskerror_event_backtrace(char *buf, struct diskerror_event *ev) { char bt_buf[MAX_BACKTRACE_SIZE]; if (!buf || !ev) return -1; sprintf(bt_buf, "BACKTRACE=" \ "timestamp=%s\n" \ "dev=%s\n" \ "sector=%llu\n" \ "nr_sector=%u\n" \ "error=%s\n" \ "rwbs=%s\n" \ "cmd=%s\n", \ ev->timestamp, \ ev->dev, \ ev->sector, \ ev->nr_sector, \ ev->error, \ ev->rwbs, \ ev->cmd); strcat(buf, bt_buf); return 0; } static int set_mf_event_backtrace(char *buf, struct ras_mf_event *ev) { char bt_buf[MAX_BACKTRACE_SIZE]; if (!buf || !ev) return -1; sprintf(bt_buf, "BACKTRACE=" \ "timestamp=%s\n" \ "pfn=%s\n" \ "page_type=%s\n" \ "action_result=%s\n", \ ev->timestamp, \ ev->pfn, \ ev->page_type, \ ev->action_result); strcat(buf, bt_buf); return 0; } static int set_cxl_poison_event_backtrace(char *buf, struct ras_cxl_poison_event *ev) { char bt_buf[MAX_BACKTRACE_SIZE]; if (!buf || !ev) return -1; sprintf(bt_buf, "BACKTRACE=" \ "timestamp=%s\n" \ "memdev=%s\n" \ "host=%s\n" \ "serial=0x%lx\n" \ "trace_type=%s\n" \ "region=%s\n" \ "region_uuid=%s\n" \ "hpa=0x%lx\n" \ "dpa=0x%lx\n" \ "dpa_length=0x%x\n" \ "source=%s\n" \ "flags=%u\n" \ "overflow_timestamp=%s\n", \ ev->timestamp, \ ev->memdev, \ ev->host, \ ev->serial, \ ev->trace_type, \ ev->region, \ ev->uuid, \ ev->hpa, \ ev->dpa, \ ev->dpa_length, \ ev->source, \ ev->flags, \ ev->overflow_ts); strcat(buf, bt_buf); return 0; } static int set_cxl_aer_ue_event_backtrace(char *buf, struct ras_cxl_aer_ue_event *ev) { char bt_buf[MAX_BACKTRACE_SIZE]; if (!buf || !ev) return -1; sprintf(bt_buf, "BACKTRACE=" \ "timestamp=%s\n" \ "memdev=%s\n" \ "host=%s\n" \ "serial=0x%lx\n" \ "error_status=%u\n" \ "first_error=%u\n", \ ev->timestamp, \ ev->memdev, \ ev->host, \ ev->serial, \ ev->error_status, \ ev->first_error); strcat(buf, bt_buf); return 0; } static int set_cxl_aer_ce_event_backtrace(char *buf, struct ras_cxl_aer_ce_event *ev) { char bt_buf[MAX_BACKTRACE_SIZE]; if (!buf || !ev) return -1; sprintf(bt_buf, "BACKTRACE=" \ "timestamp=%s\n" \ "memdev=%s\n" \ "host=%s\n" \ "serial=0x%lx\n" \ "error_status=%u\n", \ ev->timestamp, \ ev->memdev, \ ev->host, \ ev->serial, \ ev->error_status); strcat(buf, bt_buf); return 0; } static int set_cxl_overflow_event_backtrace(char *buf, struct ras_cxl_overflow_event *ev) { char bt_buf[MAX_BACKTRACE_SIZE]; if (!buf || !ev) return -1; sprintf(bt_buf, "BACKTRACE=" \ "timestamp=%s\n" \ "memdev=%s\n" \ "host=%s\n" \ "serial=0x%lx\n" \ "log_type=%s\n" \ "count=%u\n" \ "first_ts=%s\n" \ "last_ts=%s\n", \ ev->timestamp, \ ev->memdev, \ ev->host, \ ev->serial, \ ev->log_type, \ ev->count, \ ev->first_ts, \ ev->last_ts); strcat(buf, bt_buf); return 0; } static int set_cxl_generic_event_backtrace(char *buf, struct ras_cxl_generic_event *ev) { char bt_buf[MAX_BACKTRACE_SIZE]; if (!buf || !ev) return -1; sprintf(bt_buf, "BACKTRACE=" \ "timestamp=%s\n" \ "memdev=%s\n" \ "host=%s\n" \ "serial=0x%lx\n" \ "log_type=%s\n" \ "hdr_uuid=%s\n" \ "hdr_flags=0x%x\n" \ "hdr_handle=0x%x\n" \ "hdr_related_handle=0x%x\n" \ "hdr_timestamp=%s\n" \ "hdr_length=%u\n" \ "hdr_maint_op_class=%u\n", \ ev->hdr.timestamp, \ ev->hdr.memdev, \ ev->hdr.host, \ ev->hdr.serial, \ ev->hdr.log_type, \ ev->hdr.hdr_uuid, \ ev->hdr.hdr_flags, \ ev->hdr.hdr_handle, \ ev->hdr.hdr_related_handle, \ ev->hdr.hdr_timestamp, \ ev->hdr.hdr_length, \ ev->hdr.hdr_maint_op_class); strcat(buf, bt_buf); return 0; } static int set_cxl_general_media_event_backtrace(char *buf, struct ras_cxl_general_media_event *ev) { char bt_buf[MAX_BACKTRACE_SIZE]; if (!buf || !ev) return -1; sprintf(bt_buf, "BACKTRACE=" \ "timestamp=%s\n" \ "memdev=%s\n" \ "host=%s\n" \ "serial=0x%lx\n" \ "log_type=%s\n" \ "hdr_uuid=%s\n" \ "hdr_flags=0x%x\n" \ "hdr_handle=0x%x\n" \ "hdr_related_handle=0x%x\n" \ "hdr_timestamp=%s\n" \ "hdr_length=%u\n" \ "hdr_maint_op_class=%u\n" \ "dpa=0x%lx\n" \ "dpa_flags=%u\n" \ "descriptor=%u\n" \ "type=%u\n" \ "transaction_type=%u\n" \ "channel=%u\n" \ "rank=%u\n" \ "device=0x%x\n", \ ev->hdr.timestamp, \ ev->hdr.memdev, \ ev->hdr.host, \ ev->hdr.serial, \ ev->hdr.log_type, \ ev->hdr.hdr_uuid, \ ev->hdr.hdr_flags, \ ev->hdr.hdr_handle, \ ev->hdr.hdr_related_handle, \ ev->hdr.hdr_timestamp, \ ev->hdr.hdr_length, \ ev->hdr.hdr_maint_op_class, \ ev->dpa, \ ev->dpa_flags, \ ev->descriptor, \ ev->type, \ ev->transaction_type, \ ev->channel, \ ev->rank, \ ev->device); strcat(buf, bt_buf); return 0; } static int set_cxl_dram_event_backtrace(char *buf, struct ras_cxl_dram_event *ev) { char bt_buf[MAX_BACKTRACE_SIZE]; if (!buf || !ev) return -1; sprintf(bt_buf, "BACKTRACE=" \ "timestamp=%s\n" \ "memdev=%s\n" \ "host=%s\n" \ "serial=0x%lx\n" \ "log_type=%s\n" \ "hdr_uuid=%s\n" \ "hdr_flags=0x%x\n" \ "hdr_handle=0x%x\n" \ "hdr_related_handle=0x%x\n" \ "hdr_timestamp=%s\n" \ "hdr_length=%u\n" \ "hdr_maint_op_class=%u\n" \ "dpa=0x%lx\n" \ "dpa_flags=%u\n" \ "descriptor=%u\n" \ "type=%u\n" \ "transaction_type=%u\n" \ "channel=%u\n" \ "rank=%u\n" \ "nibble_mask=%u\n" \ "bank_group=%u\n" \ "bank=%u\n" \ "row=%u\n" \ "column=%u\n", \ ev->hdr.timestamp, \ ev->hdr.memdev, \ ev->hdr.host, \ ev->hdr.serial, \ ev->hdr.log_type, \ ev->hdr.hdr_uuid, \ ev->hdr.hdr_flags, \ ev->hdr.hdr_handle, \ ev->hdr.hdr_related_handle, \ ev->hdr.hdr_timestamp, \ ev->hdr.hdr_length, \ ev->hdr.hdr_maint_op_class, \ ev->dpa, \ ev->dpa_flags, \ ev->descriptor, \ ev->type, \ ev->transaction_type, \ ev->channel, \ ev->rank, \ ev->nibble_mask, \ ev->bank_group, \ ev->bank, \ ev->row, \ ev->column); strcat(buf, bt_buf); return 0; } static int set_cxl_memory_module_event_backtrace(char *buf, struct ras_cxl_memory_module_event *ev) { char bt_buf[MAX_BACKTRACE_SIZE]; if (!buf || !ev) return -1; sprintf(bt_buf, "BACKTRACE=" \ "timestamp=%s\n" \ "memdev=%s\n" \ "host=%s\n" \ "serial=0x%lx\n" \ "log_type=%s\n" \ "hdr_uuid=%s\n" \ "hdr_flags=0x%x\n" \ "hdr_handle=0x%x\n" \ "hdr_related_handle=0x%x\n" \ "hdr_timestamp=%s\n" \ "hdr_length=%u\n" \ "hdr_maint_op_class=%u\n" \ "event_type=%u\n" \ "health_status=%u\n" \ "media_status=%u\n" \ "life_used=%u\n" \ "dirty_shutdown_cnt=%u\n" \ "cor_vol_err_cnt=%u\n" \ "cor_per_err_cnt=%u\n" \ "device_temp=%d\n" \ "add_status=%u\n", \ ev->hdr.timestamp, \ ev->hdr.memdev, \ ev->hdr.host, \ ev->hdr.serial, \ ev->hdr.log_type, \ ev->hdr.hdr_uuid, \ ev->hdr.hdr_flags, \ ev->hdr.hdr_handle, \ ev->hdr.hdr_related_handle, \ ev->hdr.hdr_timestamp, \ ev->hdr.hdr_length, \ ev->hdr.hdr_maint_op_class, \ ev->event_type, \ ev->health_status, \ ev->media_status, \ ev->life_used, \ ev->dirty_shutdown_cnt, \ ev->cor_vol_err_cnt, \ ev->cor_per_err_cnt, \ ev->device_temp, \ ev->add_status); strcat(buf, bt_buf); return 0; } static int commit_report_backtrace(int sockfd, int type, void *ev) { char buf[MAX_BACKTRACE_SIZE]; char *pbuf = buf; int rc = -1; int buf_len = 0; if (sockfd < 0 || !ev) { return -1; } memset(buf, 0, MAX_BACKTRACE_SIZE); switch (type) { case MC_EVENT: rc = set_mc_event_backtrace(buf, (struct ras_mc_event *)ev); break; case AER_EVENT: rc = set_aer_event_backtrace(buf, (struct ras_aer_event *)ev); break; case MCE_EVENT: rc = set_mce_event_backtrace(buf, (struct mce_event *)ev); break; case NON_STANDARD_EVENT: rc = set_non_standard_event_backtrace(buf, (struct ras_non_standard_event *)ev); break; case ARM_EVENT: rc = set_arm_event_backtrace(buf, (struct ras_arm_event *)ev); break; case DEVLINK_EVENT: rc = set_devlink_event_backtrace(buf, (struct devlink_event *)ev); break; case DISKERROR_EVENT: rc = set_diskerror_event_backtrace(buf, (struct diskerror_event *)ev); break; case MF_EVENT: rc = set_mf_event_backtrace(buf, (struct ras_mf_event *)ev); break; case CXL_POISON_EVENT: rc = set_cxl_poison_event_backtrace(buf, (struct ras_cxl_poison_event *)ev); break; case CXL_AER_UE_EVENT: rc = set_cxl_aer_ue_event_backtrace(buf, (struct ras_cxl_aer_ue_event *)ev); break; case CXL_AER_CE_EVENT: rc = set_cxl_aer_ce_event_backtrace(buf, (struct ras_cxl_aer_ce_event *)ev); break; case CXL_OVERFLOW_EVENT: rc = set_cxl_overflow_event_backtrace(buf, (struct ras_cxl_overflow_event *)ev); break; case CXL_GENERIC_EVENT: rc = set_cxl_generic_event_backtrace(buf, (struct ras_cxl_generic_event *)ev); break; case CXL_GENERAL_MEDIA_EVENT: rc = set_cxl_general_media_event_backtrace(buf, (struct ras_cxl_general_media_event *)ev); break; case CXL_DRAM_EVENT: rc = set_cxl_dram_event_backtrace(buf, (struct ras_cxl_dram_event *)ev); break; case CXL_MEMORY_MODULE_EVENT: rc = set_cxl_memory_module_event_backtrace(buf, (struct ras_cxl_memory_module_event *)ev); break; default: return -1; } if (rc < 0) { return -1; } buf_len = strlen(buf); for (; buf_len > INPUT_BUFFER_SIZE - 1; buf_len -= (INPUT_BUFFER_SIZE - 1)) { rc = write(sockfd, pbuf, INPUT_BUFFER_SIZE - 1); if (rc < INPUT_BUFFER_SIZE - 1) { return -1; } pbuf = pbuf + INPUT_BUFFER_SIZE - 1; } rc = write(sockfd, pbuf, buf_len + 1); if (rc < buf_len) { return -1; } return 0; } int ras_report_mc_event(struct ras_events *ras, struct ras_mc_event *ev) { char buf[MAX_MESSAGE_SIZE]; int sockfd = -1; int done = 0; int rc = -1; memset(buf, 0, sizeof(buf)); sockfd = setup_report_socket(); if (sockfd < 0) { return -1; } rc = commit_report_basic(sockfd); if (rc < 0) { goto mc_fail; } rc = commit_report_backtrace(sockfd, MC_EVENT, ev); if (rc < 0) { goto mc_fail; } sprintf(buf, "ANALYZER=%s", "rasdaemon-mc"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) { goto mc_fail; } sprintf(buf, "REASON=%s", "EDAC driver report problem"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) { goto mc_fail; } done = 1; mc_fail: if (sockfd >= 0) { close(sockfd); } if (done) { return 0; } else { return -1; } } int ras_report_aer_event(struct ras_events *ras, struct ras_aer_event *ev) { char buf[MAX_MESSAGE_SIZE]; int sockfd = 0; int done = 0; int rc = -1; memset(buf, 0, sizeof(buf)); sockfd = setup_report_socket(); if (sockfd < 0) { return -1; } rc = commit_report_basic(sockfd); if (rc < 0) { goto aer_fail; } rc = commit_report_backtrace(sockfd, AER_EVENT, ev); if (rc < 0) { goto aer_fail; } sprintf(buf, "ANALYZER=%s", "rasdaemon-aer"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) { goto aer_fail; } sprintf(buf, "REASON=%s", "PCIe AER driver report problem"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) { goto aer_fail; } done = 1; aer_fail: if (sockfd >= 0) { close(sockfd); } if (done) { return 0; } else { return -1; } } int ras_report_non_standard_event(struct ras_events *ras, struct ras_non_standard_event *ev) { char buf[MAX_MESSAGE_SIZE]; int sockfd = 0; int rc = -1; memset(buf, 0, sizeof(buf)); sockfd = setup_report_socket(); if (sockfd < 0) { return rc; } rc = commit_report_basic(sockfd); if (rc < 0) { goto non_standard_fail; } rc = commit_report_backtrace(sockfd, NON_STANDARD_EVENT, ev); if (rc < 0) { goto non_standard_fail; } sprintf(buf, "ANALYZER=%s", "rasdaemon-non-standard"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) { goto non_standard_fail; } sprintf(buf, "REASON=%s", "Unknown CPER section problem"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) { goto non_standard_fail; } rc = 0; non_standard_fail: if (sockfd >= 0) { close(sockfd); } return rc; } int ras_report_arm_event(struct ras_events *ras, struct ras_arm_event *ev) { char buf[MAX_MESSAGE_SIZE]; int sockfd = 0; int rc = -1; memset(buf, 0, sizeof(buf)); sockfd = setup_report_socket(); if (sockfd < 0) { return rc; } rc = commit_report_basic(sockfd); if (rc < 0) { goto arm_fail; } rc = commit_report_backtrace(sockfd, ARM_EVENT, ev); if (rc < 0) { goto arm_fail; } sprintf(buf, "ANALYZER=%s", "rasdaemon-arm"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) { goto arm_fail; } sprintf(buf, "REASON=%s", "ARM CPU report problem"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) { goto arm_fail; } rc = 0; arm_fail: if (sockfd >= 0) { close(sockfd); } return rc; } int ras_report_mce_event(struct ras_events *ras, struct mce_event *ev) { char buf[MAX_MESSAGE_SIZE]; int sockfd = 0; int done = 0; int rc = -1; memset(buf, 0, sizeof(buf)); sockfd = setup_report_socket(); if (sockfd < 0) { return -1; } rc = commit_report_basic(sockfd); if (rc < 0) { goto mce_fail; } rc = commit_report_backtrace(sockfd, MCE_EVENT, ev); if (rc < 0) { goto mce_fail; } sprintf(buf, "ANALYZER=%s", "rasdaemon-mce"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) { goto mce_fail; } sprintf(buf, "REASON=%s", "Machine Check driver report problem"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) { goto mce_fail; } done = 1; mce_fail: if (sockfd >= 0) { close(sockfd); } if (done) { return 0; } else { return -1; } } int ras_report_devlink_event(struct ras_events *ras, struct devlink_event *ev) { char buf[MAX_MESSAGE_SIZE]; int sockfd = 0; int done = 0; int rc = -1; memset(buf, 0, sizeof(buf)); sockfd = setup_report_socket(); if (sockfd < 0) { return -1; } rc = commit_report_basic(sockfd); if (rc < 0) { goto devlink_fail; } rc = commit_report_backtrace(sockfd, DEVLINK_EVENT, ev); if (rc < 0) { goto devlink_fail; } sprintf(buf, "ANALYZER=%s", "rasdaemon-devlink"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) { goto devlink_fail; } sprintf(buf, "REASON=%s", "devlink health report problem"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) { goto devlink_fail; } done = 1; devlink_fail: if (sockfd >= 0) { close(sockfd); } if (done) { return 0; } else { return -1; } } int ras_report_diskerror_event(struct ras_events *ras, struct diskerror_event *ev) { char buf[MAX_MESSAGE_SIZE]; int sockfd = 0; int done = 0; int rc = -1; memset(buf, 0, sizeof(buf)); sockfd = setup_report_socket(); if (sockfd < 0) { return -1; } rc = commit_report_basic(sockfd); if (rc < 0) { goto diskerror_fail; } rc = commit_report_backtrace(sockfd, DISKERROR_EVENT, ev); if (rc < 0) { goto diskerror_fail; } sprintf(buf, "ANALYZER=%s", "rasdaemon-diskerror"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) { goto diskerror_fail; } sprintf(buf, "REASON=%s", "disk I/O error"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) { goto diskerror_fail; } done = 1; diskerror_fail: if (sockfd >= 0) { close(sockfd); } if (done) { return 0; } else { return -1; } } int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { char buf[MAX_MESSAGE_SIZE]; int sockfd = 0; int done = 0; int rc = -1; memset(buf, 0, sizeof(buf)); sockfd = setup_report_socket(); if (sockfd < 0) return -1; rc = commit_report_basic(sockfd); if (rc < 0) goto mf_fail; rc = commit_report_backtrace(sockfd, MF_EVENT, ev); if (rc < 0) goto mf_fail; sprintf(buf, "ANALYZER=%s", "rasdaemon-memory_failure"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) goto mf_fail; sprintf(buf, "REASON=%s", "memory failure problem"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) goto mf_fail; done = 1; mf_fail: if (sockfd >= 0) close(sockfd); if (done) return 0; else return -1; } int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { char buf[MAX_MESSAGE_SIZE]; int sockfd = 0; int done = 0; int rc = -1; memset(buf, 0, sizeof(buf)); sockfd = setup_report_socket(); if (sockfd < 0) return -1; rc = commit_report_basic(sockfd); if (rc < 0) goto cxl_poison_fail; rc = commit_report_backtrace(sockfd, CXL_POISON_EVENT, ev); if (rc < 0) goto cxl_poison_fail; sprintf(buf, "ANALYZER=%s", "rasdaemon-cxl-poison"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) goto cxl_poison_fail; sprintf(buf, "REASON=%s", "CXL poison"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) goto cxl_poison_fail; done = 1; cxl_poison_fail: if (sockfd >= 0) close(sockfd); if (done) return 0; else return -1; } int ras_report_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev) { char buf[MAX_MESSAGE_SIZE]; int sockfd = 0; int done = 0; int rc = -1; memset(buf, 0, sizeof(buf)); sockfd = setup_report_socket(); if (sockfd < 0) return -1; rc = commit_report_basic(sockfd); if (rc < 0) goto cxl_aer_ue_fail; rc = commit_report_backtrace(sockfd, CXL_AER_UE_EVENT, ev); if (rc < 0) goto cxl_aer_ue_fail; sprintf(buf, "ANALYZER=%s", "rasdaemon-cxl-aer-uncorrectable-error"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) goto cxl_aer_ue_fail; sprintf(buf, "REASON=%s", "CXL AER uncorrectable error"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) goto cxl_aer_ue_fail; done = 1; cxl_aer_ue_fail: if (sockfd >= 0) close(sockfd); if (done) return 0; else return -1; } int ras_report_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev) { char buf[MAX_MESSAGE_SIZE]; int sockfd = 0; int done = 0; int rc = -1; memset(buf, 0, sizeof(buf)); sockfd = setup_report_socket(); if (sockfd < 0) return -1; rc = commit_report_basic(sockfd); if (rc < 0) goto cxl_aer_ce_fail; rc = commit_report_backtrace(sockfd, CXL_AER_CE_EVENT, ev); if (rc < 0) goto cxl_aer_ce_fail; sprintf(buf, "ANALYZER=%s", "rasdaemon-cxl-aer-correctable-error"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) goto cxl_aer_ce_fail; sprintf(buf, "REASON=%s", "CXL AER correctable error"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) goto cxl_aer_ce_fail; done = 1; cxl_aer_ce_fail: if (sockfd >= 0) close(sockfd); if (done) return 0; else return -1; } int ras_report_cxl_overflow_event(struct ras_events *ras, struct ras_cxl_overflow_event *ev) { char buf[MAX_MESSAGE_SIZE]; int sockfd = 0; int done = 0; int rc = -1; memset(buf, 0, sizeof(buf)); sockfd = setup_report_socket(); if (sockfd < 0) return -1; rc = commit_report_basic(sockfd); if (rc < 0) goto cxl_overflow_fail; rc = commit_report_backtrace(sockfd, CXL_OVERFLOW_EVENT, ev); if (rc < 0) goto cxl_overflow_fail; sprintf(buf, "ANALYZER=%s", "rasdaemon-cxl-overflow"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) goto cxl_overflow_fail; sprintf(buf, "REASON=%s", "CXL overflow"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) goto cxl_overflow_fail; done = 1; cxl_overflow_fail: if (sockfd >= 0) close(sockfd); if (done) return 0; else return -1; } int ras_report_cxl_generic_event(struct ras_events *ras, struct ras_cxl_generic_event *ev) { char buf[MAX_MESSAGE_SIZE]; int sockfd = 0; int done = 0; int rc = -1; memset(buf, 0, sizeof(buf)); sockfd = setup_report_socket(); if (sockfd < 0) return -1; rc = commit_report_basic(sockfd); if (rc < 0) goto cxl_generic_fail; rc = commit_report_backtrace(sockfd, CXL_GENERIC_EVENT, ev); if (rc < 0) goto cxl_generic_fail; sprintf(buf, "ANALYZER=%s", "rasdaemon-cxl_generic_event"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) goto cxl_generic_fail; sprintf(buf, "REASON=%s", "CXL Generic Event "); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) goto cxl_generic_fail; done = 1; cxl_generic_fail: if (sockfd >= 0) close(sockfd); if (done) return 0; else return -1; } int ras_report_cxl_general_media_event(struct ras_events *ras, struct ras_cxl_general_media_event *ev) { char buf[MAX_MESSAGE_SIZE]; int sockfd = 0; int done = 0; int rc = -1; memset(buf, 0, sizeof(buf)); sockfd = setup_report_socket(); if (sockfd < 0) return -1; rc = commit_report_basic(sockfd); if (rc < 0) goto cxl_general_media_fail; rc = commit_report_backtrace(sockfd, CXL_GENERAL_MEDIA_EVENT, ev); if (rc < 0) goto cxl_general_media_fail; sprintf(buf, "ANALYZER=%s", "rasdaemon-cxl_general_media_event"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) goto cxl_general_media_fail; sprintf(buf, "REASON=%s", "CXL General Media Event"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) goto cxl_general_media_fail; done = 1; cxl_general_media_fail: if (sockfd >= 0) close(sockfd); if (done) return 0; else return -1; } int ras_report_cxl_dram_event(struct ras_events *ras, struct ras_cxl_dram_event *ev) { char buf[MAX_MESSAGE_SIZE]; int sockfd = 0; int done = 0; int rc = -1; memset(buf, 0, sizeof(buf)); sockfd = setup_report_socket(); if (sockfd < 0) return -1; rc = commit_report_basic(sockfd); if (rc < 0) goto cxl_dram_fail; rc = commit_report_backtrace(sockfd, CXL_DRAM_EVENT, ev); if (rc < 0) goto cxl_dram_fail; sprintf(buf, "ANALYZER=%s", "rasdaemon-cxl_dram_event"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) goto cxl_dram_fail; sprintf(buf, "REASON=%s", "CXL DRAM Event"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) goto cxl_dram_fail; done = 1; cxl_dram_fail: if (sockfd >= 0) close(sockfd); if (done) return 0; else return -1; } int ras_report_cxl_memory_module_event(struct ras_events *ras, struct ras_cxl_memory_module_event *ev) { char buf[MAX_MESSAGE_SIZE]; int sockfd = 0; int done = 0; int rc = -1; memset(buf, 0, sizeof(buf)); sockfd = setup_report_socket(); if (sockfd < 0) return -1; rc = commit_report_basic(sockfd); if (rc < 0) goto cxl_memory_module_fail; rc = commit_report_backtrace(sockfd, CXL_MEMORY_MODULE_EVENT, ev); if (rc < 0) goto cxl_memory_module_fail; sprintf(buf, "ANALYZER=%s", "rasdaemon-cxl_memory_module_event"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) goto cxl_memory_module_fail; sprintf(buf, "REASON=%s", "CXL Memory Module Event"); rc = write(sockfd, buf, strlen(buf) + 1); if (rc < strlen(buf) + 1) goto cxl_memory_module_fail; done = 1; cxl_memory_module_fail: if (sockfd >= 0) close(sockfd); if (done) return 0; else return -1; } 07070100000064000081A400000000000000000000000165C04BE4000010F7000000000000000000000000000000000000002C00000000rasdaemon-0.8.0.49.git+f9cb13b/ras-report.h/* * Copyright (c) 2016, The Linux Foundation. All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 and * only version 2 as published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. */ #ifndef __RAS_REPORT_H #define __RAS_REPORT_H #include "ras-record.h" #include "ras-events.h" #include "ras-mc-handler.h" #include "ras-mce-handler.h" #include "ras-aer-handler.h" /* Maximal length of backtrace. */ #define MAX_BACKTRACE_SIZE (1024*1024) /* Amount of data received from one client for a message before reporting error. */ #define MAX_MESSAGE_SIZE (4*MAX_BACKTRACE_SIZE) /* Maximal number of characters read from socket at once. */ #define INPUT_BUFFER_SIZE (8*1024) /* ABRT socket file */ #define ABRT_SOCKET "/var/run/abrt/abrt.socket" #ifdef HAVE_ABRT_REPORT int ras_report_mc_event(struct ras_events *ras, struct ras_mc_event *ev); int ras_report_aer_event(struct ras_events *ras, struct ras_aer_event *ev); int ras_report_mce_event(struct ras_events *ras, struct mce_event *ev); int ras_report_non_standard_event(struct ras_events *ras, struct ras_non_standard_event *ev); int ras_report_arm_event(struct ras_events *ras, struct ras_arm_event *ev); int ras_report_devlink_event(struct ras_events *ras, struct devlink_event *ev); int ras_report_diskerror_event(struct ras_events *ras, struct diskerror_event *ev); int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev); int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev); int ras_report_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev); int ras_report_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev); int ras_report_cxl_overflow_event(struct ras_events *ras, struct ras_cxl_overflow_event *ev); int ras_report_cxl_generic_event(struct ras_events *ras, struct ras_cxl_generic_event *ev); int ras_report_cxl_general_media_event(struct ras_events *ras, struct ras_cxl_general_media_event *ev); int ras_report_cxl_dram_event(struct ras_events *ras, struct ras_cxl_dram_event *ev); int ras_report_cxl_memory_module_event(struct ras_events *ras, struct ras_cxl_memory_module_event *ev); #else static inline int ras_report_mc_event(struct ras_events *ras, struct ras_mc_event *ev) { return 0; }; static inline int ras_report_aer_event(struct ras_events *ras, struct ras_aer_event *ev) { return 0; }; static inline int ras_report_mce_event(struct ras_events *ras, struct mce_event *ev) { return 0; }; static inline int ras_report_non_standard_event(struct ras_events *ras, struct ras_non_standard_event *ev) { return 0; }; static inline int ras_report_arm_event(struct ras_events *ras, struct ras_arm_event *ev) { return 0; }; static inline int ras_report_devlink_event(struct ras_events *ras, struct devlink_event *ev) { return 0; }; static inline int ras_report_diskerror_event(struct ras_events *ras, struct diskerror_event *ev) { return 0; }; static inline int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; }; static inline int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; }; static inline int ras_report_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev) { return 0; }; static inline int ras_report_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev) { return 0; }; static inline int ras_report_cxl_overflow_event(struct ras_events *ras, struct ras_cxl_overflow_event *ev) { return 0; }; static inline int ras_report_cxl_generic_event(struct ras_events *ras, struct ras_cxl_generic_event *ev) { return 0; }; static inline int ras_report_cxl_general_media_event(struct ras_events *ras, struct ras_cxl_general_media_event *ev) { return 0; }; static inline int ras_report_cxl_dram_event(struct ras_events *ras, struct ras_cxl_dram_event *ev) { return 0; }; static inline int ras_report_cxl_memory_module_event(struct ras_events *ras, struct ras_cxl_memory_module_event *ev) { return 0; }; #endif #endif 07070100000065000081A400000000000000000000000165C04BE40000130A000000000000000000000000000000000000002B00000000rasdaemon-0.8.0.49.git+f9cb13b/rasdaemon.c/* * Copyright (C) 2013 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ #include <argp.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include "ras-record.h" #include "ras-logger.h" #include "ras-events.h" /* * Arguments(argp) handling logic and main */ #define TOOL_NAME "rasdaemon" #define TOOL_DESCRIPTION "RAS daemon to log the RAS events." #define ARGS_DOC "<options>" #define DISABLE "DISABLE" char *choices_disable; const char *argp_program_version = TOOL_NAME " " VERSION; const char *argp_program_bug_address = "Mauro Carvalho Chehab <mchehab@kernel.org>"; struct arguments { int record_events; int enable_ras; int foreground; int offline; }; enum OFFLINE_ARG_KEYS { SMCA = 0x100, MODEL, FAMILY, BANK_NUM, IPID_REG, STATUS_REG, SYNDROME_REG }; struct ras_mc_offline_event event; static error_t parse_opt(int k, char *arg, struct argp_state *state) { struct arguments *args = state->input; switch (k) { case 'e': args->enable_ras++; break; case 'd': args->enable_ras--; break; #ifdef HAVE_SQLITE3 case 'r': args->record_events++; break; #endif case 'f': args->foreground++; break; #ifdef HAVE_MCE case 'p': if (state->argc < 4) argp_state_help(state, stdout, ARGP_HELP_LONG | ARGP_HELP_EXIT_ERR); args->offline++; break; #endif default: return ARGP_ERR_UNKNOWN; } return 0; } #ifdef HAVE_MCE static error_t parse_opt_offline(int key, char *arg, struct argp_state *state) { switch (key) { case SMCA: event.smca = true; break; case MODEL: event.model = strtoul(state->argv[state->next], NULL, 0); break; case FAMILY: event.family = strtoul(state->argv[state->next], NULL, 0); break; case BANK_NUM: event.bank = atoi(state->argv[state->next]); break; case IPID_REG: event.ipid = strtoull(state->argv[state->next], NULL, 0); break; case STATUS_REG: event.status = strtoull(state->argv[state->next], NULL, 0); break; case SYNDROME_REG: event.synd = strtoull(state->argv[state->next], NULL, 0); break; default: return ARGP_ERR_UNKNOWN; } return 0; } #endif long user_hz; int main(int argc, char *argv[]) { struct arguments args; int idx = -1; choices_disable = getenv(DISABLE); #ifdef HAVE_MCE const struct argp_option offline_options[] = { {"smca", SMCA, 0, 0, "AMD SMCA Error Decoding"}, {"model", MODEL, 0, 0, "CPU Model"}, {"family", FAMILY, 0, 0, "CPU Family"}, {"bank", BANK_NUM, 0, 0, "Bank Number"}, {"ipid", IPID_REG, 0, 0, "IPID Register (for SMCA systems only)"}, {"status", STATUS_REG, 0, 0, "Status Register"}, {"synd", SYNDROME_REG, 0, 0, "Syndrome Register"}, {0, 0, 0, 0, 0, 0}, }; struct argp offline_argp = { .options = offline_options, .parser = parse_opt_offline, .doc = TOOL_DESCRIPTION, .args_doc = ARGS_DOC, }; struct argp_child offline_parser[] = { {&offline_argp, 0, "Post-Processing Options:", 0}, {0, 0, 0, 0}, }; #endif const struct argp_option options[] = { {"enable", 'e', 0, 0, "enable RAS events and exit", 0}, {"disable", 'd', 0, 0, "disable RAS events and exit", 0}, #ifdef HAVE_SQLITE3 {"record", 'r', 0, 0, "record events via sqlite3", 0}, #endif {"foreground", 'f', 0, 0, "run foreground, not daemonize"}, #ifdef HAVE_MCE {"post-processing", 'p', 0, 0, "Post-processing MCE's with raw register values"}, #endif { 0, 0, 0, 0, 0, 0 } }; const struct argp argp = { .options = options, .parser = parse_opt, .doc = TOOL_DESCRIPTION, .args_doc = ARGS_DOC, #ifdef HAVE_MCE .children = offline_parser, #endif }; memset(&args, 0, sizeof(args)); user_hz = sysconf(_SC_CLK_TCK); argp_parse(&argp, argc, argv, 0, &idx, &args); if (idx < 0) { argp_help(&argp, stderr, ARGP_HELP_STD_HELP, TOOL_NAME); return -1; } if (args.enable_ras) { int enable; enable = (args.enable_ras > 0) ? 1 : 0; toggle_ras_mc_event(enable); return 0; } #ifdef HAVE_MCE if (args.offline) { ras_offline_mce_event(&event); return 0; } #endif openlog(TOOL_NAME, 0, LOG_DAEMON); if (!args.foreground) if (daemon(0, 0)) exit(EXIT_FAILURE); handle_ras_events(args.record_events); return 0; } 07070100000066000081A400000000000000000000000165C04BE400002163000000000000000000000000000000000000002800000000rasdaemon-0.8.0.49.git+f9cb13b/rbtree.c/* Red Black Trees (C) 1999 Andrea Arcangeli <andrea@suse.de> (C) 2002 David Woodhouse <dwmw2@infradead.org> Taken from the Linux 2.6.30 source with some minor modificatons. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA linux/lib/rbtree.c */ #include "rbtree.h" static void __rb_rotate_left(struct rb_node *node, struct rb_root *root) { struct rb_node *right = node->rb_right; struct rb_node *parent = rb_parent(node); node->rb_right = right->rb_left; if (node->rb_right) rb_set_parent(right->rb_left, node); right->rb_left = node; rb_set_parent(right, parent); if (parent) { if (node == parent->rb_left) parent->rb_left = right; else parent->rb_right = right; } else root->rb_node = right; rb_set_parent(node, right); } static void __rb_rotate_right(struct rb_node *node, struct rb_root *root) { struct rb_node *left = node->rb_left; struct rb_node *parent = rb_parent(node); node->rb_left = left->rb_right; if (node->rb_left) rb_set_parent(left->rb_right, node); left->rb_right = node; rb_set_parent(left, parent); if (parent) { if (node == parent->rb_right) parent->rb_right = left; else parent->rb_left = left; } else root->rb_node = left; rb_set_parent(node, left); } void rb_insert_color(struct rb_node *node, struct rb_root *root) { struct rb_node *parent, *gparent; while ((parent = rb_parent(node)) && rb_is_red(parent)) { gparent = rb_parent(parent); if (parent == gparent->rb_left) { { register struct rb_node *uncle = gparent->rb_right; if (uncle && rb_is_red(uncle)) { rb_set_black(uncle); rb_set_black(parent); rb_set_red(gparent); node = gparent; continue; } } if (parent->rb_right == node) { struct rb_node *tmp; __rb_rotate_left(parent, root); tmp = parent; parent = node; node = tmp; } rb_set_black(parent); rb_set_red(gparent); __rb_rotate_right(gparent, root); } else { { struct rb_node *uncle = gparent->rb_left; if (uncle && rb_is_red(uncle)) { rb_set_black(uncle); rb_set_black(parent); rb_set_red(gparent); node = gparent; continue; } } if (parent->rb_left == node) { struct rb_node *tmp; __rb_rotate_right(parent, root); tmp = parent; parent = node; node = tmp; } rb_set_black(parent); rb_set_red(gparent); __rb_rotate_left(gparent, root); } } rb_set_black(root->rb_node); } static void __rb_erase_color(struct rb_node *node, struct rb_node *parent, struct rb_root *root) { struct rb_node *other; while ((!node || rb_is_black(node)) && node != root->rb_node) { if (parent->rb_left == node) { other = parent->rb_right; if (rb_is_red(other)) { rb_set_black(other); rb_set_red(parent); __rb_rotate_left(parent, root); other = parent->rb_right; } if ((!other->rb_left || rb_is_black(other->rb_left)) && (!other->rb_right || rb_is_black(other->rb_right))) { rb_set_red(other); node = parent; parent = rb_parent(node); } else { if (!other->rb_right || rb_is_black(other->rb_right)) { rb_set_black(other->rb_left); rb_set_red(other); __rb_rotate_right(other, root); other = parent->rb_right; } rb_set_color(other, rb_color(parent)); rb_set_black(parent); rb_set_black(other->rb_right); __rb_rotate_left(parent, root); node = root->rb_node; break; } } else { other = parent->rb_left; if (rb_is_red(other)) { rb_set_black(other); rb_set_red(parent); __rb_rotate_right(parent, root); other = parent->rb_left; } if ((!other->rb_left || rb_is_black(other->rb_left)) && (!other->rb_right || rb_is_black(other->rb_right))) { rb_set_red(other); node = parent; parent = rb_parent(node); } else { if (!other->rb_left || rb_is_black(other->rb_left)) { rb_set_black(other->rb_right); rb_set_red(other); __rb_rotate_left(other, root); other = parent->rb_left; } rb_set_color(other, rb_color(parent)); rb_set_black(parent); rb_set_black(other->rb_left); __rb_rotate_right(parent, root); node = root->rb_node; break; } } } if (node) rb_set_black(node); } void rb_erase(struct rb_node *node, struct rb_root *root) { struct rb_node *child, *parent; int color; if (!node->rb_left) child = node->rb_right; else if (!node->rb_right) child = node->rb_left; else { struct rb_node *old = node, *left; node = node->rb_right; while ((left = node->rb_left) != NULL) node = left; child = node->rb_right; parent = rb_parent(node); color = rb_color(node); if (child) rb_set_parent(child, parent); if (parent == old) { parent->rb_right = child; parent = node; } else parent->rb_left = child; node->rb_parent_color = old->rb_parent_color; node->rb_right = old->rb_right; node->rb_left = old->rb_left; if (rb_parent(old)) { if (rb_parent(old)->rb_left == old) rb_parent(old)->rb_left = node; else rb_parent(old)->rb_right = node; } else root->rb_node = node; rb_set_parent(old->rb_left, node); if (old->rb_right) rb_set_parent(old->rb_right, node); goto color; } parent = rb_parent(node); color = rb_color(node); if (child) rb_set_parent(child, parent); if (parent) { if (parent->rb_left == node) parent->rb_left = child; else parent->rb_right = child; } else root->rb_node = child; color: if (color == RB_BLACK) __rb_erase_color(child, parent, root); } /* * This function returns the first node (in sort order) of the tree. */ struct rb_node *rb_first(const struct rb_root *root) { struct rb_node *n; n = root->rb_node; if (!n) return NULL; while (n->rb_left) n = n->rb_left; return n; } struct rb_node *rb_last(const struct rb_root *root) { struct rb_node *n; n = root->rb_node; if (!n) return NULL; while (n->rb_right) n = n->rb_right; return n; } struct rb_node *rb_next(const struct rb_node *node) { struct rb_node *parent; if (rb_parent(node) == node) return NULL; /* If we have a right-hand child, go down and then left as far as we can. */ if (node->rb_right) { node = node->rb_right; while (node->rb_left) node = node->rb_left; return (struct rb_node *)node; } /* No right-hand children. Everything down and left is smaller than us, so any 'next' node must be in the general direction of our parent. Go up the tree; any time the ancestor is a right-hand child of its parent, keep going up. First time it's a left-hand child of its parent, said parent is our 'next' node. */ while ((parent = rb_parent(node)) && node == parent->rb_right) node = parent; return parent; } struct rb_node *rb_prev(const struct rb_node *node) { struct rb_node *parent; if (rb_parent(node) == node) return NULL; /* If we have a left-hand child, go down and then right as far as we can. */ if (node->rb_left) { node = node->rb_left; while (node->rb_right) node = node->rb_right; return (struct rb_node *)node; } /* No left-hand children. Go up till we find an ancestor which is a right-hand child of its parent */ while ((parent = rb_parent(node)) && node == parent->rb_left) node = parent; return parent; } void rb_replace_node(struct rb_node *victim, struct rb_node *new, struct rb_root *root) { struct rb_node *parent = rb_parent(victim); /* Set the surrounding nodes to point to the replacement */ if (parent) { if (victim == parent->rb_left) parent->rb_left = new; else parent->rb_right = new; } else { root->rb_node = new; } if (victim->rb_left) rb_set_parent(victim->rb_left, new); if (victim->rb_right) rb_set_parent(victim->rb_right, new); /* Copy the pointers/colour from the victim to the replacement */ *new = *victim; } 07070100000067000081A400000000000000000000000165C04BE4000013EB000000000000000000000000000000000000002800000000rasdaemon-0.8.0.49.git+f9cb13b/rbtree.h/* Red Black Trees (C) 1999 Andrea Arcangeli <andrea@suse.de> Taken from the Linux 2.6.30 source. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA linux/include/linux/rbtree.h To use rbtrees you'll have to implement your own insert and search cores. This will avoid us to use callbacks and to drop drammatically performances. I know it's not the cleaner way, but in C (not in C++) to get performances and genericity... Some example of insert and search follows here. The search is a plain normal search over an ordered tree. The insert instead must be implemented int two steps: as first thing the code must insert the element in order as a red leaf in the tree, then the support library function rb_insert_color() must be called. Such function will do the not trivial work to rebalance the rbtree if necessary. ----------------------------------------------------------------------- static inline struct page * rb_search_page_cache(struct inode * inode, unsigned long offset) { struct rb_node * n = inode->i_rb_page_cache.rb_node; struct page * page; while (n) { page = rb_entry(n, struct page, rb_page_cache); if (offset < page->offset) n = n->rb_left; else if (offset > page->offset) n = n->rb_right; else return page; } return NULL; } static inline struct page * __rb_insert_page_cache(struct inode * inode, unsigned long offset, struct rb_node * node) { struct rb_node ** p = &inode->i_rb_page_cache.rb_node; struct rb_node * parent = NULL; struct page * page; while (*p) { parent = *p; page = rb_entry(parent, struct page, rb_page_cache); if (offset < page->offset) p = &(*p)->rb_left; else if (offset > page->offset) p = &(*p)->rb_right; else return page; } rb_link_node(node, parent, p); return NULL; } static inline struct page * rb_insert_page_cache(struct inode * inode, unsigned long offset, struct rb_node * node) { struct page * ret; if ((ret = __rb_insert_page_cache(inode, offset, node))) goto out; rb_insert_color(node, &inode->i_rb_page_cache); out: return ret; } ----------------------------------------------------------------------- */ #ifndef _LINUX_RBTREE_H #define _LINUX_RBTREE_H #include <stddef.h> #define container_of(ptr, type, member) ({ \ const typeof( ((type *)0)->member ) *__mptr = (ptr); \ (type *)( (char *)__mptr - offsetof(type,member) );}) struct rb_node { unsigned long rb_parent_color; #define RB_RED 0 #define RB_BLACK 1 struct rb_node *rb_right; struct rb_node *rb_left; } __attribute__((aligned(sizeof(long)))); /* The alignment might seem pointless, but allegedly CRIS needs it */ struct rb_root { struct rb_node *rb_node; }; #define rb_parent(r) ((struct rb_node *)((r)->rb_parent_color & ~3)) #define rb_color(r) ((r)->rb_parent_color & 1) #define rb_is_red(r) (!rb_color(r)) #define rb_is_black(r) rb_color(r) #define rb_set_red(r) do { (r)->rb_parent_color &= ~1; } while (0) #define rb_set_black(r) do { (r)->rb_parent_color |= 1; } while (0) static inline void rb_set_parent(struct rb_node *rb, struct rb_node *p) { rb->rb_parent_color = (rb->rb_parent_color & 3) | (unsigned long)p; } static inline void rb_set_color(struct rb_node *rb, int color) { rb->rb_parent_color = (rb->rb_parent_color & ~1) | color; } #define RB_ROOT (struct rb_root) { NULL, } #define rb_entry(ptr, type, member) container_of(ptr, type, member) #define RB_EMPTY_ROOT(root) ((root)->rb_node == NULL) #define RB_EMPTY_NODE(node) (rb_parent(node) == node) #define RB_CLEAR_NODE(node) (rb_set_parent(node, node)) extern void rb_insert_color(struct rb_node *, struct rb_root *); extern void rb_erase(struct rb_node *, struct rb_root *); /* Find logical next and previous nodes in a tree */ extern struct rb_node *rb_next(const struct rb_node *); extern struct rb_node *rb_prev(const struct rb_node *); extern struct rb_node *rb_first(const struct rb_root *); extern struct rb_node *rb_last(const struct rb_root *); /* Fast replacement of a single node without remove/rebalance/add/rebalance */ extern void rb_replace_node(struct rb_node *victim, struct rb_node *new, struct rb_root *root); static inline void rb_link_node(struct rb_node * node, struct rb_node * parent, struct rb_node ** rb_link) { node->rb_parent_color = (unsigned long )parent; node->rb_left = node->rb_right = NULL; *rb_link = node; } #endif /* _LINUX_RBTREE_H */ 07070100000068000041ED00000000000000000000000265C04BE400000000000000000000000000000000000000000000002400000000rasdaemon-0.8.0.49.git+f9cb13b/util07070100000069000081A400000000000000000000000165C04BE40000000C000000000000000000000000000000000000002F00000000rasdaemon-0.8.0.49.git+f9cb13b/util/.gitignoreras-mc-ctl 0707010000006A000081A400000000000000000000000165C04BE400000022000000000000000000000000000000000000003000000000rasdaemon-0.8.0.49.git+f9cb13b/util/Makefile.amdist_sbin_SCRIPTS = \ ras-mc-ctl 0707010000006B000081ED00000000000000000000000165C04BE4000101E4000000000000000000000000000000000000003200000000rasdaemon-0.8.0.49.git+f9cb13b/util/ras-mc-ctl.in#!/usr/bin/perl -w #****************************************************************************** # Copyright (c) 2013 Mauro Carvalho Chehab <mchehab+redhat@kernel.org> # # This tool is a modification of the edac-ctl, written as part of the # edac-utils: # Copyright (C) 2003-2006 The Regents of the University of California. # Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER). # Written by Mark Grondona <mgrondona@llnl.gov> # UCRL-CODE-230739. # # This version uses the new EDAC v 3.0.0 and upper API, with adds proper # representation for the memory controllers found on Intel designs after # 2002. It requires Linux Kernel 3.5 or upper to work. # # This is free software; you can redistribute it and/or modify it # under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This is distributed in the hope that it will be useful, but WITHOUT # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or # FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License # for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #****************************************************************************/ use strict; use File::Basename; use File::Find; use Getopt::Long; use POSIX; my $dbname = "@RASSTATEDIR@/@RAS_DB_FNAME@"; my $prefix = "@prefix@"; my $sysconfdir = "@sysconfdir@"; my $dmidecode = find_prog ("dmidecode"); my $modprobe = find_prog ("modprobe") or exit (1); my $has_aer = 0; my $has_arm = 0; my $has_devlink = 0; my $has_disk_errors = 0; my $has_extlog = 0; my $has_mem_failure = 0; my $has_mce = 0; @WITH_AER_TRUE@$has_aer = 1; @WITH_ARM_TRUE@$has_arm = 1; @WITH_DEVLINK_TRUE@$has_devlink = 1; @WITH_DISKERROR_TRUE@$has_disk_errors = 1; @WITH_EXTLOG_TRUE@$has_extlog = 1; @WITH_MEMORY_FAILURE_TRUE@$has_mem_failure = 1; @WITH_MCE_TRUE@$has_mce = 1; my %conf = (); my %bus = (); my %dimm_size = (); my %dimm_node = (); my %dimm_label_file = (); my %dimm_location = (); my %csrow_size = (); my %rank_size = (); my %csrow_ranks = (); my %dimm_ce_count = (); my %dimm_ue_count = (); my @layers; my @max_pos; my @max_csrow; my $item_size; my $prog = basename $0; $conf{labeldb} = "$sysconfdir/ras/dimm_labels.db"; $conf{labeldir} = "$sysconfdir/ras/dimm_labels.d"; $conf{mbconfig} = "$sysconfdir/ras/mainboard"; my $status = 0; my $usage = <<EOF; Usage: $prog [OPTIONS...] --quiet Quiet operation. --mainboard Print mainboard vendor and model for this hardware. --status Print status of EDAC drivers. --print-labels Print Motherboard DIMM labels to stdout. --guess-labels Print DMI labels, when bank locator is available. --register-labels Load Motherboard DIMM labels into EDAC driver. --delay=N Delay N seconds before writing DIMM labels. --labeldb=DB Load label database from file DB. --layout Display the memory layout. --summary Presents a summary of the logged errors. --errors Shows the errors stored at the error database. --error-count Shows the corrected and uncorrected error counts using sysfs. --since=YYYY-MM-DD Only include events since the date YYYY-MM-DD. --vendor-errors-summary <platform-id> Presents a summary of the vendor-specific logged errors. --vendor-errors <platform-id> Shows the vendor-specific errors stored in the error database. --vendor-errors <platform-id> <module-name> Shows the vendor-specific errors for a specific module stored in the error database. --vendor-platforms List the supported platforms with platform-ids for the vendor-specific errors. --help This help message. EOF parse_cmdline(); if ( $conf{opt}{mainboard} || $conf{opt}{print_labels} || $conf{opt}{register_labels} || $conf{opt}{display_memory_layout} || $conf{opt}{guess_dimm_label} || $conf{opt}{error_count}) { get_mainboard_info(); if ($conf{opt}{mainboard} eq "report") { print "$prog: mainboard: ", "$conf{mainboard}{vendor} model $conf{mainboard}{model}\n"; } if ($conf{opt}{print_labels}) { print_dimm_labels (); } if ($conf{opt}{register_labels}) { register_dimm_labels (); } if ($conf{opt}{display_memory_layout}) { display_memory_layout (); } if ($conf{opt}{guess_dimm_label}) { guess_dimm_label (); } if ($conf{opt}{error_count}) { display_error_count (); } } if ($conf{opt}{status}) { $status = print_status (); exit ($status ? 0 : 1); } if ($conf{opt}{summary}) { summary (); } if ($conf{opt}{errors}) { errors (); } if ($conf{opt}{vendor_errors_summary}) { vendor_errors_summary (); } if ($conf{opt}{vendor_errors}) { vendor_errors (); } if ($conf{opt}{vendor_platforms}) { vendor_platforms (); } exit (0); sub parse_cmdline { $conf{opt}{mainboard} = ''; $conf{opt}{print_labels} = 0; $conf{opt}{register_labels} = 0; $conf{opt}{status} = 0; $conf{opt}{quiet} = 0; $conf{opt}{delay} = 0; $conf{opt}{display_memory_layout} = 0; $conf{opt}{guess_dimm_label} = 0; $conf{opt}{summary} = 0; $conf{opt}{errors} = 0; $conf{opt}{error_count} = 0; $conf{opt}{vendor_errors_summary} = 0; $conf{opt}{vendor_errors} = 0; $conf{opt}{since} = ''; $conf{opt}{vendor_platforms} = 0; my $rref = \$conf{opt}{report}; my $mref = \$conf{opt}{mainboard}; Getopt::Long::Configure ("bundling"); my $rc = GetOptions ("mainboard:s" => sub { $$mref = $_[1]||"report" }, "help" => sub {usage (0)}, "quiet" => \$conf{opt}{quiet}, "print-labels" => \$conf{opt}{print_labels}, "guess-labels" => \$conf{opt}{guess_dimm_label}, "register-labels" => \$conf{opt}{register_labels}, "delay:s" => \$conf{opt}{delay}, "labeldb=s" => \$conf{labeldb}, "status" => \$conf{opt}{status}, "layout" => \$conf{opt}{display_memory_layout}, "summary" => \$conf{opt}{summary}, "errors" => \$conf{opt}{errors}, "error-count" => \$conf{opt}{error_count}, "vendor-errors-summary" => \$conf{opt}{vendor_errors_summary}, "vendor-errors" => \$conf{opt}{vendor_errors}, "since=s" => \$conf{opt}{since}, "vendor-platforms" => \$conf{opt}{vendor_platforms}, ); usage(1) if !$rc; usage (0) if !grep $conf{opt}{$_}, keys %{$conf{opt}}; if ($conf{opt}{delay} && !$conf{opt}{register_labels}) { log_error ("Only use --delay with --register-labels\n"); exit (1); } if ($conf{opt}{since}) { if ($conf{opt}{since} !~ /^20\d\d-[01]\d-[0-3]\d/) { log_error ("--since requires a date like yyyy-mm-dd where yyyy is the year, mm the month, and dd the day\n"); exit (1); } $conf{opt}{since} = " where timestamp>='$conf{opt}{since}'"; } } sub usage { my ($rc) = @_; print "$usage\n"; exit ($rc); } sub run_cmd { my @args = @_; system ("@args"); return ($?>>8); } sub print_status { my $status = 0; open (MODULES, "/proc/modules") or die "Unable to open /proc/modules: $!\n"; while (<MODULES>) { $status = 1 if /_edac/; } print "$prog: drivers ", ($status ? "are" : "not"), " loaded.\n" unless $conf{opt}{quiet}; return ($status); } sub parse_dimm_nodes { my $file = $File::Find::name; if (($file =~ /max_location$/)) { open IN, $file; my $location = <IN>; $location =~ s/\s+$//; close IN; my @temp = split(/ /, $location); $layers[0] = "mc"; if (m,/mc/mc(\d+),) { $max_pos[0] = $1 if (!exists($max_pos[0]) || $1 > $max_pos[0]); } else { $max_pos[0] = 0 if (!exists($max_pos[0])); } for (my $i = 0; $i < scalar(@temp); $i += 2) { $layers[$i / 2 + 1] = $temp[$i]; $max_pos[$i / 2 + 1] = $temp[$i + 1]; } return; } if ($file =~ /size_mb$/) { my $mc = $file; $mc =~ s,.*mc(\d+).*,$1,; my $csrow = $file; $csrow =~ s,.*csrow(\d+).*,$1,; open IN, $file; my $size = <IN>; close IN; my $str_loc = join(':', $mc, $csrow); $csrow_size{$str_loc} = $size; return; } if ($file =~ /location$/) { my $mc = $file; $mc =~ s,.*mc(\d+).*,$1,; my $dimm = $file; $dimm =~ s,.*(rank|dimm)(\d+).*,$2,; open IN, $file; my $location = <IN>; $location =~ s/\s+$//; close IN; my @pos; # Get the name of the hierarchy labels if (!@layers) { my @temp = split(/ /, $location); $max_pos[0] = 0; $layers[0] = "mc"; for (my $i = 0; $i < scalar(@temp); $i += 2) { $layers[$i / 2 + 1] = $temp[$i]; $max_pos[$i / 2 + 1] = 0; } } my @temp = split(/ /, $location); for (my $i = 1; $i < scalar(@temp); $i += 2) { $pos[$i / 2] = $temp[$i]; if ($pos[$i / 2] > $max_pos[$i / 2 + 1]) { $max_pos[$i / 2 + 1] = $pos[$i / 2]; } } if ($mc > $max_pos[0]) { $max_pos[0] = $mc; } # Get DIMM size $file =~ s/dimm_location/size/; open IN, $file; my $size = <IN>; close IN; my $str_loc = join(':', $mc, @pos); $dimm_size{$str_loc} = $size; $dimm_node{$str_loc} = $dimm; $file =~ s/size/dimm_label/; $dimm_label_file{$str_loc} = $file; $dimm_location{$str_loc} = $location; my $count; $file =~s/dimm_label/dimm_ce_count/; if (-e $file) { open IN, $file; chomp($count = <IN>); close IN; } else { log_error ("dimm_ce_count not found in sysfs. Old kernel?\n"); exit -1; } $dimm_ce_count{$str_loc} = $count; $file =~s/dimm_ce_count/dimm_ue_count/; if (-e $file) { open IN, $file; chomp($count = <IN>); close IN; } else { log_error ("dimm_ue_count not found in sysfs. Old kernel?\n"); exit -1; } $dimm_ue_count{$str_loc} = $count; return; } } sub guess_product { my $pvendor = undef; my $pname = undef; if (open (VENDOR, "/sys/class/dmi/id/product_vendor")) { $pvendor = <VENDOR>; close VENDOR; chomp($pvendor); } if (open (NAME, "/sys/class/dmi/id/product_name")) { $pname = <NAME>; close NAME; chomp($pname); } return ($pvendor, $pname); } sub get_mainboard_info { my ($vendor, $model); my ($pvendor, $pname); if ($conf{opt}{mainboard} && $conf{opt}{mainboard} ne "report") { ($vendor, $model) = split (/[: ]/, $conf{opt}{mainboard}, 2); } if (!$vendor || !$model) { ($vendor, $model) = guess_vendor_model (); } $conf{mainboard}{vendor} = $vendor; $conf{mainboard}{model} = $model; ($pvendor, $pname) = guess_product (); # since product vendor is rare, use mainboard's vendor if ($pvendor) { $conf{mainboard}{product_vendor} = $pvendor; } else { $conf{mainboard}{product_vendor} = $vendor; } $conf{mainboard}{product_name} = $pname if $pname; } sub guess_vendor_model_dmidecode { my ($vendor, $model); my ($system_vendor, $system_model); my $line = 0; $< == 0 || die "Must be root to run dmidecode\n"; open (DMI, "$dmidecode |") or die "failed to run $dmidecode: $!\n"; $vendor = $model = ""; LINE: while (<DMI>) { $line++; /^(\s*)(board|base board|system) information/i || next LINE; my $indent = $1; my $type = $2; while ( <DMI> ) { /^(\s*)/; $1 lt $indent && last LINE; $indent = $1; if ($type eq "system") { /(?:manufacturer|vendor):\s*(.*\S)\s*/i && ( $system_vendor = $1 ); /product(?: name)?:\s*(.*\S)\s*/i && ( $system_model = $1 ); } else { /(?:manufacturer|vendor):\s*(.*\S)\s*/i && ( $vendor = $1 ); /product(?: name)?:\s*(.*\S)\s*/i && ( $model = $1 ); } last LINE if ($vendor && $model); } } close (DMI); $vendor = $system_vendor if ($vendor eq ""); $model = $system_model if ($model eq ""); return ($vendor, $model); } sub guess_vendor_model_sysfs { # # Try to look up DMI information in sysfs # open (VENDOR, "/sys/class/dmi/id/board_vendor") or return undef; open (MODEL, "/sys/class/dmi/id/board_name") or return undef; my ($vendor, $model) = (<VENDOR>, <MODEL>); close (VENDOR); close (MODEL); return undef unless ($vendor && $model); chomp ($vendor, $model); return ($vendor, $model); } sub parse_mainboard_config { my ($file) = @_; my %hash = (); my $line = 0; open (CFG, "$file") or die "Failed to read mainboard config: $file: $!\n"; while (<CFG>) { $line++; chomp; # remove newline s/^((?:[^'"#]*(?:(['"])[^\2]*\2)*)*)#.*/$1/; # remove comments s/^\s+//; # remove leading space s/\s+$//; # remove trailing space next unless length; # skip blank lines if (my ($key, $val) = /^\s*([-\w]+)\s*=\s*(.*)/) { $hash{$key}{val} = $val; $hash{$key}{line} = $line; next; } return undef; } close (CFG) or &log_error ("close $file: $!\n"); return \%hash; } sub guess_vendor_model { my ($vendor, $model); # # If mainboard config file exists then parse it # to get the vendor and model information. # if (-f $conf{mbconfig} ) { my $cfg = &parse_mainboard_config ($conf{mbconfig}); # If mainboard config file specified a script, then try to # run the specified script or executable: # if ($cfg->{"script"}) { $cfg = &parse_mainboard_config ("$cfg->{script}{val} |"); die "Failed to run mainboard script\n" if (!$cfg); } return ($cfg->{vendor}{val}, $cfg->{model}{val}); } ($vendor, $model) = &guess_vendor_model_sysfs (); return ($vendor, $model) if ($vendor && $model); return (&guess_vendor_model_dmidecode ()); } sub guess_dimm_label { open (DMI, "$dmidecode |") or die "failed to run $dmidecode: $!\n"; LINE: while (<DMI>) { /^(\s*)memory device$/i || next LINE; my ($dimm_label, $dimm_addr); while (<DMI>) { if (/^\s*(locator|bank locator)/i) { my $indent = $1; $indent =~ tr/A-Z/a-z/; if ($indent eq "locator") { /(?:locator):\s*(.*\S)\s*/i && ( $dimm_label = $1 ); } if ($indent eq "bank locator") { /(?:bank locator):\s*(.*\S)\s*/i && ( $dimm_addr = $1 ); } } if ($dimm_label && $dimm_addr) { printf "memory stick '%s' is located at '%s'\n", $dimm_label, $dimm_addr; next LINE; } next LINE if (/^\s*\n/); } } close (DMI); } sub parse_dimm_labels_file { my ($lh, $num_layers, $lh_prod, $num_layers_prod, $file) = (@_); my $line = -1; my $vendor = ""; my @models = (); my @products = (); my $num; open (LABELS, "$file") or die "Unable to open label database: $file: $!\n"; while (<LABELS>) { $line++; next if /^#/; chomp; s/^\s+//; s/\s+$//; next unless length; if (/vendor\s*:\s*(.*\S)\s*/i) { $vendor = lc $1; @models = (); @products = (); $num = 0; next; } if (/(model|board)\s*:\s*(.*)$/i) { !$vendor && die "$file: line $line: MB model without vendor\n"; @models = grep { s/\s*(.*)\s*$/$1/ } split(/[,;]+/, $2); @products = (); $num = 0; next; } if (/(product)\s*:\s*(.*)$/i) { !$vendor && die "$file: line $line: product without vendor\n"; @models = (); @products = grep { s/\s*(.*)\s*$/$1/ } split(/[,;]+/, $2); $num = 0; next; } # Allow multiple labels to be specified on a single line, # separated by ; for my $str (split /;/) { $str =~ s/^\s*(.*)\s*$/$1/; next unless (my ($label, $info) = ($str =~ /^(.*)\s*:\s*(.*)$/i)); unless ($info =~ /\d+(?:[\.\:]\d+)*/) { log_error ("$file: $line: Invalid syntax, ignoring: \"$_\"\n"); next; } for my $target (split (/[, ]+/, $info)) { my $n; my ($mc, $top, $mid, $low, $extra) = ($target =~ /(\d+)(?:[\.\:](\d+)){0,1}(?:[\.\:](\d+)){0,1}(?:[\.\:](\d+)){0,1}(?:[\.\:](\d+)){0,1}/); if (defined($extra)) { die ("Error: Only up to 3 layers are currently supported on label db \"$file\"\n"); return; } elsif (!defined($top)) { die ("Error: The label db \"$file\" is defining a zero-layers machine\n"); return; } else { $n = 3; if (!defined($low)) { $low = 0; $n--; } if (!defined($mid)) { $mid = 0; $n--; } map { $lh->{$vendor}{lc $_}{$mc}{$top}{$mid}{$low} = $label } @models; map { $lh_prod->{$vendor}{lc $_}{$mc}{$top}{$mid}{$low} = $label } @products; } if (!$num) { $num = $n; map { $num_layers->{$vendor}{lc $_} = $num } @models; map { $num_layers_prod->{$vendor}{lc $_} = $num } @products; } elsif ($num != $n) { die ("Error: Inconsistent number of layers at label db \"$file\"\n"); } } } } close (LABELS) or die "Error from label db \"$file\" : $!\n"; } sub parse_dimm_labels { my %labels = (); my %num_layers = (); my %labels_prod = (); my %num_layers_prod = (); # # Accrue all DIMM labels from the labels.db file, as # well as any files under the labels dir # for my $file ($conf{labeldb}, <$conf{labeldir}/*>) { next unless -r $file; parse_dimm_labels_file (\%labels, \%num_layers, \%labels_prod, \%num_layers_prod, $file); } return (\%labels, \%num_layers, \%labels_prod, \%num_layers_prod); } sub read_dimm_label { my ($num_layers, $mc, $top, $mid, $low) = @_; my $sysfs = "/sys/devices/system/edac/mc"; my $pos; $pos = "$mc:$top:$mid:$low" if ($num_layers == 3); $pos = "$mc:$top:$mid" if ($num_layers == 2); $pos = "$mc:$top" if ($num_layers == 1); if (!defined($dimm_node{$pos})) { my $label = "$pos missing"; $pos = ""; return ($label, $pos); } my $dimm = $dimm_node{$pos}; my $dimm_label_file = $dimm_label_file{$pos}; my $location = $dimm_location{$pos}; return ("label missing", "$pos missing") unless -f $dimm_label_file; if (!open (LABEL, "$dimm_label_file")) { warn "Failed to open $dimm_label_file: $!\n"; return ("Error"); } chomp (my $label = <LABEL> || ""); close (LABEL); $pos = "mc$mc $location"; return ($label, $pos); } sub get_dimm_label_node { my ($num_layers, $mc, $top, $mid, $low) = @_; my $sysfs = "/sys/devices/system/edac/mc"; my $pos = "$mc:$top:$mid:$low"; $pos = "$mc:$top:$mid:$low" if ($num_layers == 3); $pos = "$mc:$top:$mid" if ($num_layers == 2); $pos = "$mc:$top" if ($num_layers == 1); return "" if (!defined($dimm_node{$pos})); return "$dimm_label_file{$pos}"; } sub _print_dimm_labels { my ($lref, $num_layers, $vendor, $model, $fh, $format) = @_; for my $mc (sort keys %{$$lref{$vendor}{$model}}) { for my $top (sort keys %{$$lref{$vendor}{$model}{$mc}}) { for my $mid (sort keys %{$$lref{$vendor}{$model}{$mc}{$top}}) { for my $low (sort keys %{$$lref{$vendor}{$model}{$mc}{$top}{$mid}}) { my $label = $$lref{$vendor}{$model}{$mc}{$top}{$mid}{$low}; my ($rlabel,$loc) = read_dimm_label ($$num_layers{$vendor}{$model}, $mc, $top, $mid, $low); printf $fh $format, $loc, $label, $rlabel; } } } } print $fh "\n"; } sub print_dimm_labels { my $fh = shift || *STDOUT; my ($lref, $num_layers, $lref_prod, $num_layers_prod) = parse_dimm_labels (); my $vendor = lc $conf{mainboard}{vendor}; my $model = lc $conf{mainboard}{model}; my $pvendor = lc $conf{mainboard}{product_vendor}; my $pname = lc $conf{mainboard}{product_name}; my $format = "%-35s %-20s %-20s\n"; if (!exists $$lref{$vendor}{$model} && !exists $$lref_prod{$pvendor}{$pname}) { log_error ("No dimm labels for $conf{mainboard}{vendor} " . "model $conf{mainboard}{model}\n"); return; } my $sysfs_dir = "/sys/devices/system/edac/mc"; find({wanted => \&parse_dimm_nodes, no_chdir => 1}, $sysfs_dir); printf $fh $format, "LOCATION", "CONFIGURED LABEL", "SYSFS CONTENTS"; if (exists $$lref{$vendor}{$model}) { _print_dimm_labels($lref, $num_layers, $vendor, $model, $fh, $format); } elsif (exists $$lref_prod{$pvendor}{$pname}) { _print_dimm_labels($lref_prod, $num_layers_prod, $pvendor, $pname, $fh, $format); } } sub write_dimm_labels { my ($lref, $num_layers, $vendor, $model) = @_; for my $mc (sort keys %{$$lref{$vendor}{$model}}) { for my $top (sort keys %{$$lref{$vendor}{$model}{$mc}}) { for my $mid (sort keys %{$$lref{$vendor}{$model}{$mc}{$top}}) { for my $low (sort keys %{$$lref{$vendor}{$model}{$mc}{$top}{$mid}}) { my $file = get_dimm_label_node($$num_layers{$vendor}{$model}, $mc, $top, $mid, $low); # Ignore sysfs files that don't exist. Might just be # unpopulated bank. next unless -f $file; if (!open (DL, ">$file")) { warn ("Unable to open $file\n"); next; } syswrite DL, $$lref{$vendor}{$model}{$mc}{$top}{$mid}{$low}; close (DL); } } } } } sub register_dimm_labels { my ($lref, $num_layers, $lref_prod, $num_layers_prod) = parse_dimm_labels (); my $vendor = lc $conf{mainboard}{vendor}; my $model = lc $conf{mainboard}{model}; my $pvendor = lc $conf{mainboard}{product_vendor}; my $pname = lc $conf{mainboard}{product_name}; my $sysfs = "/sys/devices/system/edac/mc"; if (!exists $$lref{$vendor}{$model} && !exists $$lref_prod{$pvendor}{$pname}) { log_error ("No dimm labels for $conf{mainboard}{vendor} " . "model $conf{mainboard}{model}\n"); return 0; } my $sysfs_dir = "/sys/devices/system/edac/mc"; find({wanted => \&parse_dimm_nodes, no_chdir => 1}, $sysfs_dir); select (undef, undef, undef, $conf{opt}{delay}); if (exists $$lref{$vendor}{$model}) { write_dimm_labels($lref, $num_layers, $vendor, $model); } else { write_dimm_labels($lref_prod, $num_layers_prod, $pvendor, $pname); } return 1; } sub dimm_display_layer_rev($@); sub dimm_display_layer_rev($@) { my $layer = shift; my @pos = @_; $layer++; if ($layer >= scalar(@pos) - 1) { my $str_loc = join(':', @pos); my $size = $dimm_size{$str_loc}; if (!$size) { $size = 0; } my $s = sprintf " %4i MB |", $size; $item_size = length($s); return $s; } my $s; for (my $i = 0; $i <= $max_pos[$layer]; $i++) { $pos[$layer] = $i; $s .= dimm_display_layer_rev($layer, @pos); } return $s; } sub dimm_display_layer(@) { my @pos = @_; my $s; for (my $i = 0; $i <= $max_pos[0]; $i++) { $pos[0] = $i; $s .= dimm_display_layer_rev(0, @pos); } return $s; } sub dimm_display_layer_header($$) { my $n_items = 1; my $scale; my $layer = shift; my $tot_items = shift; my $s; for (my $i = 0; $i <= $layer; $i++) { $n_items *= $max_pos[$i] + 1; } $scale = $tot_items / $n_items; my $d = 0; for (my $i = 0; $i < $n_items; $i++) { my $val = sprintf("%s%d", $layers[$layer], $d); $val = substr($val, 0, $scale * $item_size - 2); my $fillsize = $scale * $item_size - 1 - length($val); $s .= "|"; $s .= " " x ($fillsize / 2); $s .= $val; $s .= " " x ($fillsize - floor($fillsize / 2)); $d++; if ($d > $max_pos[$layer]) { $d = 0; } } $s .= "|"; return $s; } sub dimm_display_mem() { my @pos = @max_pos; my $sep = ""; my $tot_items = 1; my $first = 1; for (my $i = 0; $i < scalar(@pos) - 1; $i++) { $pos[$i] = 0; $tot_items *= $max_pos[$i] + 1; } my $is_even = $max_pos[scalar(@max_pos) - 1] % 2; for (my $d = $max_pos[scalar(@max_pos) - 1]; $d >= 0; $d--) { my $len; my $s = sprintf("%s%d: |", $layers[scalar(@max_pos) - 1], $d); my $p1 = length($s) - 1; $pos[scalar(@pos) - 1] = $d; $s .= dimm_display_layer(@pos); $len += length($s); $sep = "-" x $p1; $sep .= "+"; $sep .= "-" x ($len - $p1 - 2); $sep .= "+"; if ($first) { my $sep1 = " " x $p1; $sep1 .= "+"; $sep1 .= "-" x ($len - $p1 - 2); $sep1 .= "+"; printf "$sep1\n"; for (my $layer = 0; $layer < scalar(@pos) - 1; $layer++) { my $s = sprintf("%s%d: |", $layers[scalar(@max_pos) - 1], 0); my $p1 = length($s) - 1; my $msg = " " x $p1; $msg .= dimm_display_layer_header($layer, $tot_items); printf "$msg\n"; } printf "$sep\n" if (!$is_even); $first = 0; } if ($is_even && (($max_pos[scalar(@max_pos) - 1] - $d) % 2 == 0)) { printf "$sep\n"; } printf "$s\n"; } printf "$sep\n"; } sub fill_csrow_size() { foreach my $str_loc (keys %rank_size) { my @temp = split(/:/, $str_loc); my $csrow = join(':', $temp[0], $temp[1]); if ($csrow_ranks{$csrow}) { $rank_size{$str_loc} = $csrow_size{$csrow} / $csrow_ranks{$csrow}; } } } sub display_memory_layout { my $sysfs_dir = "/sys/devices/system/edac/mc"; find({wanted => \&parse_dimm_nodes, no_chdir => 1}, $sysfs_dir); if (!scalar(%csrow_size)) { log_error ("No memories found at via edac.\n"); exit -1; } elsif (!scalar(%dimm_size)) { fill_csrow_size; $layers[0] = "mc"; $layers[1] = "csrow"; $layers[2] = "channel"; @max_pos = @max_csrow; %dimm_size = %rank_size; } dimm_display_mem(); } sub display_error_count { my $sysfs_dir = "/sys/devices/system/edac/mc"; my $key; my $max_width = 0; my %dimm_labels = (); find ({wanted => \&parse_dimm_nodes, no_chdir => 1}, $sysfs_dir); if (!scalar(keys %dimm_node)) { log_error ("No DIMMs found in /sys or new sysfs EDAC interface not found.\n"); exit -1; } foreach $key (keys %dimm_node) { my $label_width; open IN, $dimm_label_file{$key}; chomp(my $label = <IN>); close IN; $label_width = length $label; if ($label_width > $max_width) { $max_width = $label_width; } $dimm_labels{$key} = $label; } my $string = "Label"; $string .= " " x ($max_width - length $string); print($string . "\tCE\tUE\n"); foreach $key (keys %dimm_node) { my $ce_count = $dimm_ce_count{$key}; my $ue_count = $dimm_ue_count{$key}; print("$dimm_labels{$key}\t$ce_count\t$ue_count\n"); } } sub find_prog { my ($file) = @_; for my $dir ("/sbin", "/usr/sbin", split ':', $ENV{PATH}) { return "$dir/$file" if -x "$dir/$file"; } # log_error ("Failed to find $file in PATH\n"); return ""; } sub get_extlog_type { my @types; if ($_[0] < 0 || $_[0] > 15) { return "unknown-type"; } @types = ("unknown", "no error", "single-bit ECC", "multi-bit ECC", "single-symbol chipkill ECC", "multi-symbol chipkill ECC", "master abort", "target abort", "parity error", "watchdog timeout", "invalid address", "mirror Broken", "memory sparing", "scrub corrected error", "scrub uncorrected error", "physical memory map-out event", "unknown-type"); return $types[$_[0]]; } sub get_extlog_severity { my @sev; if ($_[0] < 0 || $_[0] > 3) { return "unknown-severity"; } @sev = ("recoverable", "fatal", "corrected", "informational", "unknown-severity"); return $sev[$_[0]]; } use constant { CPER_MEM_VALID_NODE => 0x0008, CPER_MEM_VALID_CARD => 0x0010, CPER_MEM_VALID_MODULE => 0x0020, CPER_MEM_VALID_BANK => 0x0040, CPER_MEM_VALID_DEVICE => 0x0080, CPER_MEM_VALID_ROW => 0x0100, CPER_MEM_VALID_COLUMN => 0x0200, CPER_MEM_VALID_BIT_POSITION => 0x0400, CPER_MEM_VALID_REQUESTOR_ID => 0x0800, CPER_MEM_VALID_RESPONDER_ID => 0x1000, CPER_MEM_VALID_TARGET_ID => 0x2000, CPER_MEM_VALID_ERROR_TYPE => 0x4000, CPER_MEM_VALID_RANK_NUMBER => 0x8000, CPER_MEM_VALID_CARD_HANDLE => 0x10000, CPER_MEM_VALID_MODULE_HANDLE => 0x20000, }; sub get_cper_data_text { my $cper_data = $_[0]; my ($validation_bits, $node, $card, $module, $bank, $device, $row, $column, $bit_pos, $requestor_id, $responder_id, $target_id, $rank, $mem_array_handle, $mem_dev_handle) = unpack 'QSSSSSSSSQQQSSS', $cper_data; my @out; if ($validation_bits & CPER_MEM_VALID_NODE) { push @out, (sprintf "node=%d", $node); } if ($validation_bits & CPER_MEM_VALID_CARD) { push @out, (sprintf "card=%d", $card); } if ($validation_bits & CPER_MEM_VALID_MODULE) { push @out, (sprintf "module=%d", $module); } if ($validation_bits & CPER_MEM_VALID_BANK) { push @out, (sprintf "bank=%d", $bank); } if ($validation_bits & CPER_MEM_VALID_DEVICE) { push @out, (sprintf "device=%d", $device); } if ($validation_bits & CPER_MEM_VALID_ROW) { push @out, (sprintf "row=%d", $row); } if ($validation_bits & CPER_MEM_VALID_COLUMN) { push @out, (sprintf "column=%d", $column); } if ($validation_bits & CPER_MEM_VALID_BIT_POSITION) { push @out, (sprintf "bit_position=%d", $bit_pos); } if ($validation_bits & CPER_MEM_VALID_REQUESTOR_ID) { push @out, (sprintf "0x%08x", $requestor_id); } if ($validation_bits & CPER_MEM_VALID_RESPONDER_ID) { push @out, (sprintf "0x%08x", $responder_id); } if ($validation_bits & CPER_MEM_VALID_TARGET_ID) { push @out, (sprintf "0x%08x", $target_id); } if ($validation_bits & CPER_MEM_VALID_RANK_NUMBER) { push @out, (sprintf "rank=%d", $rank); } if ($validation_bits & CPER_MEM_VALID_CARD_HANDLE) { push @out, (sprintf "mem_array_handle=%d", $mem_array_handle); } if ($validation_bits & CPER_MEM_VALID_MODULE_HANDLE) { push @out, (sprintf "mem_dev_handle=%d", $mem_dev_handle); } return join (", ", @out); } sub get_uuid_le { my $out = ""; my @bytes = unpack "C*", $_[0]; my @le16_table = (3, 2, 1, 0, 5, 4, 7, 6, 8, 9, 10, 11, 12, 13, 14, 15); for (my $i = 0; $i < 16; $i++) { $out .= sprintf "%.2x", $bytes[$le16_table[$i]]; if ($i == 3 or $i == 5 or $i == 7 or $i == 9) { $out .= "-"; } } return $out; } sub summary { require DBI; my ($query, $query_handle, $out); my ($err_type, $label, $mc, $top, $mid, $low, $count, $msg, $action_result); my ($etype, $severity, $etype_string, $severity_string); my ($dev_name, $dev); my ($mpidr); my $dbh = DBI->connect("dbi:SQLite:dbname=$dbname", "", "", {}); # Memory controller mc_event errors $query = "select err_type, label, mc, top_layer,middle_layer,lower_layer, count(*) from mc_event$conf{opt}{since} group by err_type, label, mc, top_layer, middle_layer, lower_layer"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($err_type, $label, $mc, $top, $mid, $low, $count)); $out = ""; while($query_handle->fetch()) { $out .= "\t$err_type on DIMM Label(s): '$label' location: $mc:$top:$mid:$low errors: $count\n"; } if ($out ne "") { print "Memory controller events summary:\n$out\n"; } else { print "No Memory errors.\n\n"; } $query_handle->finish; # PCIe AER aer_event errors if ($has_aer == 1) { $query = "select err_type, err_msg, count(*) from aer_event$conf{opt}{since} group by err_type, err_msg"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($err_type, $msg, $count)); $out = ""; while($query_handle->fetch()) { $out .= "\t$count $err_type errors: $msg\n"; } if ($out ne "") { print "PCIe AER events summary:\n$out\n"; } else { print "No PCIe AER errors.\n\n"; } $query_handle->finish; } # ARM processor arm_event errors if ($has_arm == 1) { $query = "select mpidr, count(*) from arm_event$conf{opt}{since} group by mpidr"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($mpidr, $count)); $out = ""; while($query_handle->fetch()) { $out .= sprintf "\tCPU(mpidr=0x%x) has %d errors\n", $mpidr, $count; } if ($out ne "") { print "ARM processor events summary:\n$out\n"; } else { print "No ARM processor errors.\n\n"; } $query_handle->finish; } # extlog errors if ($has_extlog == 1) { $query = "select etype, severity, count(*) from extlog_event$conf{opt}{since} group by etype, severity"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($etype, $severity, $count)); $out = ""; while($query_handle->fetch()) { $etype_string = get_extlog_type($etype); $severity_string = get_extlog_severity($severity); $out .= "\t$count $etype_string $severity_string errors\n"; } if ($out ne "") { print "Extlog records summary:\n$out"; } else { print "No Extlog errors.\n\n"; } $query_handle->finish; } # devlink errors if ($has_devlink == 1) { $query = "select dev_name, count(*) from devlink_event$conf{opt}{since} group by dev_name"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($dev_name, $count)); $out = ""; while($query_handle->fetch()) { $out .= "\t$dev_name has $count errors\n"; } if ($out ne "") { print "Devlink records summary:\n$out"; } else { print "No devlink errors.\n"; } $query_handle->finish; } # Disk errors if ($has_disk_errors == 1) { $query = "select dev, count(*) from disk_errors$conf{opt}{since} group by dev"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($dev, $count)); $out = ""; while($query_handle->fetch()) { $out .= "\t$dev has $count errors\n"; } if ($out ne "") { print "Disk errors summary:\n$out"; } else { print "No disk errors.\n"; } $query_handle->finish; } # Memory failure errors if ($has_mem_failure == 1) { $query = "select action_result, count(*) from memory_failure_event$conf{opt}{since} group by action_result"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($action_result, $count)); $out = ""; while($query_handle->fetch()) { $out .= "\t$action_result errors: $count\n"; } if ($out ne "") { print "Memory failure events summary:\n$out\n"; } else { print "No Memory failure errors.\n\n"; } $query_handle->finish; } # MCE mce_record errors if ($has_mce == 1) { $query = "select error_msg, count(*) from mce_record$conf{opt}{since} group by error_msg"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($msg, $count)); $out = ""; while($query_handle->fetch()) { $out .= "\t$count $msg errors\n"; } if ($out ne "") { print "MCE records summary:\n$out"; } else { print "No MCE errors.\n"; } $query_handle->finish; } undef($dbh); } sub errors { require DBI; my ($query, $query_handle, $id, $time, $devname, $count, $type, $msg, $label, $mc, $top, $mid, $low, $addr, $grain, $syndrome, $detail, $out); my ($mcgcap,$mcgstatus, $status, $misc, $ip, $tsc, $walltime, $cpu, $cpuid, $apicid, $socketid, $cs, $bank, $cpuvendor, $bank_name, $mcgstatus_msg, $mcistatus_msg, $user_action, $mc_location); my ($timestamp, $etype, $severity, $etype_string, $severity_string, $fru_id, $fru_text, $cper_data); my ($bus_name, $dev_name, $driver_name, $reporter_name); my ($dev, $sector, $nr_sector, $error, $rwbs, $cmd); my ($error_count, $affinity, $mpidr, $r_state, $psci_state); my ($pfn, $page_type, $action_result); my $dbh = DBI->connect("dbi:SQLite:dbname=$dbname", "", "", {}); # Memory controller mc_event errors $query = "select id, timestamp, err_count, err_type, err_msg, label, mc, top_layer,middle_layer,lower_layer, address, grain, syndrome, driver_detail from mc_event$conf{opt}{since} order by id"; $query_handle = $dbh->prepare($query); if (!$query_handle) { log_error ("mc_event table missing from $dbname. Run 'rasdaemon --record'.\n"); exit -1 } $query_handle->execute(); $query_handle->bind_columns(\($id, $time, $count, $type, $msg, $label, $mc, $top, $mid, $low, $addr, $grain, $syndrome, $detail)); $out = ""; while($query_handle->fetch()) { $out .= "$id $time $count $type error(s): $msg at $label location: $mc:$top:$mid:$low, addr $addr, grain $grain, syndrome $syndrome $detail\n"; } if ($out ne "") { print "Memory controller events:\n$out\n"; } else { print "No Memory errors.\n\n"; } $query_handle->finish; # PCIe AER aer_event errors if ($has_aer == 1) { $query = "select id, timestamp, dev_name, err_type, err_msg from aer_event$conf{opt}{since} order by id"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($id, $time, $devname, $type, $msg)); $out = ""; while($query_handle->fetch()) { $out .= "$id $time $devname $type error: $msg\n"; } if ($out ne "") { print "PCIe AER events:\n$out\n"; } else { print "No PCIe AER errors.\n\n"; } $query_handle->finish; } # ARM processor arm_event errors if ($has_arm == 1) { $query = "select id, timestamp, error_count, affinity, mpidr, running_state, psci_state from arm_event$conf{opt}{since} order by id"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($id, $timestamp, $error_count, $affinity, $mpidr, $r_state, $psci_state)); $out = ""; while($query_handle->fetch()) { $out .= "$id $timestamp error: "; $out .= "error_count=$error_count, " if ($error_count); $out .= "affinity_level=$affinity, "; $out .= sprintf "mpidr=0x%x, ", $mpidr; $out .= sprintf "running_state=0x%x, ", $r_state; $out .= sprintf "psci_state=0x%x", $psci_state; $out .= "\n"; } if ($out ne "") { print "ARM processor events:\n$out\n"; } else { print "No ARM processor errors.\n\n"; } $query_handle->finish; } # Extlog errors if ($has_extlog == 1) { $query = "select id, timestamp, etype, severity, address, fru_id, fru_text, cper_data from extlog_event$conf{opt}{since} order by id"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($id, $timestamp, $etype, $severity, $addr, $fru_id, $fru_text, $cper_data)); $out = ""; while($query_handle->fetch()) { $etype_string = get_extlog_type($etype); $severity_string = get_extlog_severity($severity); $out .= "$id $timestamp error: "; $out .= "type=$etype_string, "; $out .= "severity=$severity_string, "; $out .= sprintf "address=0x%08x, ", $addr; $out .= sprintf "fru_id=%s, ", get_uuid_le($fru_id); $out .= "fru_text='$fru_text', "; $out .= get_cper_data_text($cper_data) if ($cper_data); $out .= "\n"; } if ($out ne "") { print "Extlog events:\n$out\n"; } else { print "No Extlog errors.\n\n"; } $query_handle->finish; } # devlink errors if ($has_devlink == 1) { $query = "select id, timestamp, bus_name, dev_name, driver_name, reporter_name, msg from devlink_event$conf{opt}{since} order by id"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($id, $timestamp, $bus_name, $dev_name, $driver_name, $reporter_name, $msg)); $out = ""; while($query_handle->fetch()) { $out .= "$id $timestamp error: "; $out .= "bus_name=$bus_name, "; $out .= "dev_name=$dev_name, "; $out .= "driver_name=$driver_name, "; $out .= "reporter_name=$reporter_name, "; $out .= "message='$msg', "; $out .= "\n"; } if ($out ne "") { print "Devlink events:\n$out\n"; } else { print "No devlink errors.\n\n"; } $query_handle->finish; } # Disk errors if ($has_disk_errors == 1) { $query = "select id, timestamp, dev, sector, nr_sector, error, rwbs, cmd from disk_errors$conf{opt}{since} order by id"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($id, $timestamp, $dev, $sector, $nr_sector, $error, $rwbs, $cmd)); $out = ""; while($query_handle->fetch()) { $out .= "$id $timestamp error: "; $out .= "dev=$dev, "; $out .= "sector=$sector, "; $out .= "nr_sector=$nr_sector, "; $out .= "error='$error', "; $out .= "rwbs='$rwbs', "; $out .= "cmd='$cmd', "; $out .= "\n"; } if ($out ne "") { print "Disk errors:\n$out\n"; } else { print "No disk errors.\n\n"; } $query_handle->finish; } # Memory failure errors if ($has_mem_failure == 1) { $query = "select id, timestamp, pfn, page_type, action_result from memory_failure_event$conf{opt}{since} order by id"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($id, $timestamp, $pfn, $page_type, $action_result)); $out = ""; while($query_handle->fetch()) { $out .= "$id $timestamp error: "; $out .= "pfn=$pfn, page_type=$page_type, action_result=$action_result\n"; } if ($out ne "") { print "Memory failure events:\n$out\n"; } else { print "No Memory failure errors.\n\n"; } $query_handle->finish; } # MCE mce_record errors if ($has_mce == 1) { $query = "select id, timestamp, mcgcap, mcgstatus, status, addr, misc, ip, tsc, walltime, cpu, cpuid, apicid, socketid, cs, bank, cpuvendor, bank_name, error_msg, mcgstatus_msg, mcistatus_msg, user_action, mc_location from mce_record$conf{opt}{since} order by id"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($id, $time, $mcgcap,$mcgstatus, $status, $addr, $misc, $ip, $tsc, $walltime, $cpu, $cpuid, $apicid, $socketid, $cs, $bank, $cpuvendor, $bank_name, $msg, $mcgstatus_msg, $mcistatus_msg, $user_action, $mc_location)); $out = ""; while($query_handle->fetch()) { $out .= "$id $time error: $msg"; $out .= ", CPU $cpuvendor" if ($cpuvendor); $out .= ", bank $bank_name" if ($bank_name); $out .= ", mcg $mcgstatus_msg" if ($mcgstatus_msg); $out .= ", mci $mcistatus_msg" if ($mcistatus_msg); $out .= ", $mc_location" if ($mc_location); $out .= ", $user_action" if ($user_action); $out .= sprintf ", mcgcap=0x%08x", $mcgcap if ($mcgcap); $out .= sprintf ", mcgstatus=0x%08x", $mcgstatus if ($mcgstatus); $out .= sprintf ", status=0x%08x", $status if ($status); $out .= sprintf ", addr=0x%08x", $addr if ($addr); $out .= sprintf ", misc=0x%08x", $misc if ($misc); $out .= sprintf ", ip=0x%08x", $ip if ($ip); $out .= sprintf ", tsc=0x%08x", $tsc if ($tsc); $out .= sprintf ", walltime=0x%08x", $walltime if ($walltime); $out .= sprintf ", cpu=0x%08x", $cpu if ($cpu); $out .= sprintf ", cpuid=0x%08x", $cpuid if ($cpuid); $out .= sprintf ", apicid=0x%08x", $apicid if ($apicid); $out .= sprintf ", socketid=0x%08x", $socketid if ($socketid); $out .= sprintf ", cs=0x%08x", $cs if ($cs); $out .= sprintf ", bank=0x%08x", $bank if ($bank); $out .= "\n"; } if ($out ne "") { print "MCE events:\n$out\n"; } else { print "No MCE errors.\n\n"; } $query_handle->finish; } undef($dbh); } # Definitions of the vendor platform IDs. use constant { HISILICON_KUNPENG_9XX => "KunPeng9xx", THEAD_YITIAN_7XX => "YiTian7XX", JM_CORSICA_DPU1XX => "CorsicaDpu1xx", }; sub vendor_errors_summary { require DBI; my ($num_args, $platform_id, $found_platform); my ($query, $query_handle, $count, $out); my ($module_id, $sub_module_id, $err_severity, $err_sev, $subsystem); my ($address); $num_args = $#ARGV + 1; $platform_id = 0; $found_platform = 0; if ($num_args ne 0) { $platform_id = $ARGV[0]; } else { usage(1); return; } my $dbh = DBI->connect("dbi:SQLite:dbname=$dbname", "", "", {}); # HiSilicon KunPeng9xx errors if ($platform_id eq HISILICON_KUNPENG_9XX) { $found_platform = 1; $query = "select err_severity, module_id, count(*) from hip08_oem_type1_event_v2$conf{opt}{since} group by err_severity, module_id"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($err_severity, $module_id, $count)); $out = ""; $err_sev = ""; while($query_handle->fetch()) { if ($err_severity ne $err_sev) { $out .= "$err_severity errors:\n"; $err_sev = $err_severity; } $out .= "\t$module_id: $count\n"; } if ($out ne "") { print "HiSilicon KunPeng9xx OEM type1 error events summary:\n$out\n"; } $query_handle->finish; $query = "select err_severity, module_id, count(*) from hip08_oem_type2_event_v2$conf{opt}{since} group by err_severity, module_id"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($err_severity, $module_id, $count)); $out = ""; $err_sev = ""; while($query_handle->fetch()) { if ($err_severity ne $err_sev) { $out .= "$err_severity errors:\n"; $err_sev = $err_severity; } $out .= "\t$module_id: $count\n"; } if ($out ne "") { print "HiSilicon KunPeng9xx OEM type2 error events summary:\n$out\n"; } $query_handle->finish; $query = "select err_severity, sub_module_id, count(*) from hip08_pcie_local_event_v2$conf{opt}{since} group by err_severity, sub_module_id"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($err_severity, $sub_module_id, $count)); $out = ""; $err_sev = ""; while($query_handle->fetch()) { if ($err_severity ne $err_sev) { $out .= "$err_severity errors:\n"; $err_sev = $err_severity; } $out .= "\t$sub_module_id: $count\n"; } if ($out ne "") { print "HiSilicon KunPeng9xx PCIe controller error events summary:\n$out\n"; } $query_handle->finish; $query = "select err_severity, module_id, count(*) from hisi_common_section_v2$conf{opt}{since} group by err_severity, module_id"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($err_severity, $module_id, $count)); $out = ""; $err_sev = ""; while($query_handle->fetch()) { if ($err_severity ne $err_sev) { $out .= "$err_severity errors:\n"; $err_sev = $err_severity; } $out .= "\t$module_id: $count\n"; } if ($out ne "") { print "HiSilicon KunPeng9xx common error events summary:\n$out\n"; } $query_handle->finish; } # THead Yitian710 DDR errors if ($platform_id eq THEAD_YITIAN_7XX) { $found_platform = 1; $query = "select address, count(*) from yitian_ddr_reg_dump_event"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($address, $count)); $out = ""; while($query_handle->fetch()) { $out .= "\terrors: $count"; } if ($out ne "") { print "THead YiTian710 DDR error dump events summary:\n$out\n"; } else { print "No THead YiTian710 DDR error dump errors.\n\n"; } $query_handle->finish; } # JaguarMicro CorsicaDpu1xx errors if ($platform_id eq JM_CORSICA_DPU1XX) { $found_platform = 1; $query = "select err_severity, subsystem, count(*) from jm_payload0_event$conf{opt}{since} group by err_severity, subsystem"; $query_handle = $dbh->prepare($query); if ($query_handle) { $query_handle->execute(); $query_handle->bind_columns(\($err_severity, $subsystem, $count)); $out = ""; $err_sev = ""; while($query_handle->fetch()) { if ($err_severity ne $err_sev) { $out .= "$err_severity errors:\n"; $err_sev = $err_severity; } $out .= "\t$subsystem: $count\n"; } if ($out ne "") { print "JaguarMicro CorsicaDpu1xx OEM type0 error events summary:\n$out\n"; } $query_handle->finish; } } if ($platform_id && !($found_platform)) { print "Platform ID $platform_id is not valid\n"; } undef($dbh); } sub vendor_errors { require DBI; my ($num_args, $platform_id, $found_platform, $module, $found_module, $module_name, $sub_module); my ($query, $query_handle, $id, $timestamp, $out); my ($version, $soc_id, $socket_id, $totem_id, $nimbus_id, $sub_system_id, $core_id, $port_id); my ($module_id, $sub_module_id, $err_severity, $err_type, $pcie_info, $regs, $subsystem, $dev, $dev_id); my ($address, $regs_dump); $num_args = $#ARGV + 1; $platform_id = 0; $found_platform = 0; $module = 0; $found_module = 0; if ($num_args ne 0) { $platform_id = $ARGV[0]; if ($num_args gt 1) { $module = $ARGV[1]; } } else { usage(1); return; } my $dbh = DBI->connect("dbi:SQLite:dbname=$dbname", "", "", {}); # HiSilicon KunPeng9xx errors if ($platform_id eq HISILICON_KUNPENG_9XX) { $found_platform = 1; $query = "select id, timestamp, version, soc_id, socket_id, nimbus_id, module_id, sub_module_id, err_severity, regs_dump from hip08_oem_type1_event_v2$conf{opt}{since} order by id, module_id, err_severity"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($id, $timestamp, $version, $soc_id, $socket_id, $nimbus_id, $module_id, $sub_module_id, $err_severity, $regs)); $out = ""; while($query_handle->fetch()) { if ($module eq 0 || ($module_id && uc($module) eq uc($module_id))) { $out .= "$id. $timestamp Error Info: "; $out .= "version=$version, "; $out .= "soc_id=$soc_id, " if (defined $soc_id && length $soc_id); $out .= "socket_id=$socket_id, " if (defined $socket_id && length $socket_id); $out .= "nimbus_id=$nimbus_id, " if (defined $nimbus_id && length $nimbus_id); $out .= "module_id=$module_id, " if (defined $module_id && length $module_id); $out .= "sub_module_id=$sub_module_id, " if (defined $sub_module_id && length $sub_module_id); $out .= "err_severity=$err_severity, " if (defined $err_severity && length $err_severity); $out .= "Error Registers: $regs " if (defined $regs && length $regs); $out .= "\n\n"; $found_module = 1; } } if ($out ne "") { print "HiSilicon KunPeng9xx OEM type1 error events:\n$out\n"; } $query_handle->finish; $query = "select id, timestamp, version, soc_id, socket_id, nimbus_id, module_id, sub_module_id, err_severity, regs_dump from hip08_oem_type2_event_v2$conf{opt}{since} order by id, module_id, err_severity"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($id, $timestamp, $version, $soc_id, $socket_id, $nimbus_id, $module_id, $sub_module_id, $err_severity, $regs)); $out = ""; while($query_handle->fetch()) { if ($module eq 0 || ($module_id && uc($module) eq uc($module_id))) { $out .= "$id. $timestamp Error Info: "; $out .= "version=$version, "; $out .= "soc_id=$soc_id, " if (defined $soc_id && length $soc_id); $out .= "socket_id=$socket_id, " if (defined $socket_id && length $socket_id); $out .= "nimbus_id=$nimbus_id, " if (defined $nimbus_id && length $nimbus_id); $out .= "module_id=$module_id, " if (defined $module_id && length $module_id); $out .= "sub_module_id=$sub_module_id, " if (defined $sub_module_id && length $sub_module_id); $out .= "err_severity=$err_severity, " if (defined $err_severity && length $err_severity); $out .= "Error Registers: $regs " if (defined $regs && length $regs); $out .= "\n\n"; $found_module = 1; } } if ($out ne "") { print "HiSilicon KunPeng9xx OEM type2 error events:\n$out\n"; } $query_handle->finish; $query = "select id, timestamp, version, soc_id, socket_id, nimbus_id, sub_module_id, core_id, port_id, err_severity, err_type, regs_dump from hip08_pcie_local_event_v2$conf{opt}{since} order by id, sub_module_id, err_severity"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($id, $timestamp, $version, $soc_id, $socket_id, $nimbus_id, $sub_module_id, $core_id, $port_id, $err_severity, $err_type, $regs)); $out = ""; while($query_handle->fetch()) { if ($module eq 0 || ($sub_module_id && uc($module) eq uc($sub_module_id))) { $out .= "$id. $timestamp Error Info: "; $out .= "version=$version, "; $out .= "soc_id=$soc_id, " if (defined $soc_id && length $soc_id); $out .= "socket_id=$socket_id, " if (defined $socket_id && length $socket_id); $out .= "nimbus_id=$nimbus_id, " if (defined $nimbus_id && length $nimbus_id); $out .= "sub_module_id=$sub_module_id, " if (defined $sub_module_id && length $sub_module_id); $out .= "core_id=$core_id, " if (defined $core_id && length $core_id); $out .= "port_id=$port_id, " if (defined $port_id && length $port_id); $out .= "err_severity=$err_severity, " if (defined $err_severity && length $err_severity); $out .= "err_type=$err_type, " if (defined $err_type && length $err_type); $out .= "Error Registers: $regs " if (defined $regs && length $regs); $out .= "\n\n"; $found_module = 1; } } if ($out ne "") { print "HiSilicon KunPeng9xx PCIe controller error events:\n$out\n"; } $query_handle->finish; $query = "select id, timestamp, version, soc_id, socket_id, totem_id, nimbus_id, sub_system_id, module_id, sub_module_id, core_id, port_id, err_type, pcie_info, err_severity, regs_dump from hisi_common_section_v2$conf{opt}{since} order by id, module_id, err_severity"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($id, $timestamp, $version, $soc_id, $socket_id, $totem_id, $nimbus_id, $sub_system_id, $module_id, $sub_module_id, $core_id, $port_id, $err_type, $pcie_info, $err_severity, $regs)); $out = ""; while($query_handle->fetch()) { if ($module eq 0 || ($module_id && uc($module) eq uc($module_id))) { $out .= "$id. $timestamp Error Info: "; $out .= "version=$version, "; $out .= "soc_id=$soc_id, " if (defined $soc_id && length $soc_id); $out .= "socket_id=$socket_id, " if (defined $socket_id && length $socket_id); $out .= "totem_id=$totem_id, " if (defined $totem_id && length $totem_id); $out .= "nimbus_id=$nimbus_id, " if (defined $nimbus_id && length $nimbus_id); $out .= "sub_system_id=$sub_system_id, " if (defined $sub_system_id && length $sub_system_id); $out .= "module_id=$module_id, " if (defined $module_id && length $module_id); $out .= "sub_module_id=$sub_module_id, " if (defined $sub_module_id && length $sub_module_id); $out .= "core_id=$core_id, " if (defined $core_id && length $core_id ); $out .= "port_id=$port_id, " if (defined $port_id && length $port_id); $out .= "err_type=$err_type, " if (defined $err_type && length $err_type); $out .= "pcie_info=$pcie_info, " if (defined $pcie_info && length $pcie_info); $out .= "err_severity=$err_severity, " if (defined $err_severity && length $err_severity); $out .= "Error Registers: $regs" if (defined $regs && length $regs); $out .= "\n\n"; $found_module = 1; } } if ($out ne "") { print "HiSilicon KunPeng9xx common error events:\n$out\n"; } $query_handle->finish; } # THead Yitian7xx ddr errors if ($platform_id eq THEAD_YITIAN_7XX) { $found_platform = 1; $query = "select id, timestamp, address, regs_dump from yitian_ddr_reg_dump_event order by id"; $query_handle = $dbh->prepare($query); $query_handle->execute(); $query_handle->bind_columns(\($id, $timestamp, $address, $regs_dump)); $out = ""; while($query_handle->fetch()) { $out .= "$id. $timestamp "; $out .= "Error Address: $address "; $out .= "Error Registers Dump: $regs_dump" if ($regs_dump); $out .= "\n\n"; } if ($out ne "") { print "THead Yitian710 DDRC error events:\n$out\n"; } else { print "No THead Yitian710 DDRC error events.\n"; } $query_handle->finish; } # JaguarMicro CorsicaDpu1xx errors if ($platform_id eq JM_CORSICA_DPU1XX) { $found_platform = 1; $query = "select id, timestamp, version, soc_id, subsystem, module, module_id, sub_module, submodule_id, dev, dev_id, err_type, err_severity, regs_dump from jm_payload0_event$conf{opt}{since} order by id, module_id, err_severity"; $query_handle = $dbh->prepare($query); if ($query_handle) { $query_handle->execute(); $query_handle->bind_columns(\($id, $timestamp, $version, $soc_id, $subsystem, $module_name, $module_id, $sub_module, $sub_module_id, $dev, $dev_id, $err_type, $err_severity, $regs)); $out = ""; while($query_handle->fetch()) { if ($module eq 0 || ($module_id && uc($module) eq uc($module_id))) { $out .= "$id. $timestamp Error Info: "; $out .= "version=$version, "; $out .= "soc_id=$soc_id, " if (defined $soc_id && length $soc_id); $out .= "subsystem=$subsystem, " if (defined $subsystem && length $subsystem); $out .= "module=$module_name, " if (defined $module_name && length $module_name); $out .= "module_id=$module_id, " if (defined $module_id && length $module_id); $out .= "sub_module=$sub_module, " if (defined $sub_module && length $sub_module); $out .= "submodule_id=$sub_module_id, " if (defined $sub_module_id && length $sub_module_id); $out .= "dev=$dev, " if (defined $dev && length $dev); $out .= "dev_id=$dev_id, " if (defined $dev_id && length $dev_id); $out .= "err_type=$err_type, " if (defined $err_type && length $err_type); $out .= "err_severity=$err_severity, " if (defined $err_severity && length $err_severity); $out .= "Error Registers: $regs " if (defined $regs && length $regs); $out .= "\n\n"; $found_module = 1; } } if ($out ne "") { print "JaguarMicro Corsica DPU1xx OEM type0 error events:\n$out\n"; } $query_handle->finish; } } if ($platform_id && !($found_platform)) { print "Platform ID $platform_id is not valid\n"; } elsif ($module && !($found_module)) { print "No error record for the module $module\n"; } undef($dbh); } sub vendor_platforms { print "\nSupported platforms for the vendor-specific errors:\n"; print "\tHiSilicon KunPeng9xx, platform-id=\"", HISILICON_KUNPENG_9XX, "\"\n"; print "\tTHead Yitian7xx, platform-id=\"", THEAD_YITIAN_7XX, "\"\n"; print "\tJaguarMicro CorsicaDpu1xx, platform-id=\"", JM_CORSICA_DPU1XX, "\"\n"; print "\n"; } sub log_msg { print STDERR "$prog: ", @_ unless $conf{opt}{quiet}; } sub log_error { log_msg ("Error: @_"); } # vi: ts=4 sw=4 expandtab 07070100000000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000B00000000TRAILER!!!1446 blocks
Locations
Projects
Search
Status Monitor
Help
OpenBuildService.org
Documentation
API Documentation
Code of Conduct
Contact
Support
@OBShq
Terms
openSUSE Build Service is sponsored by
The Open Build Service is an
openSUSE project
.
Sign Up
Log In
Places
Places
All Projects
Status Monitor