Added Airbrake notifications to StudyJsonRecord model
Setup code to check study statistics each time a load is done
Setup code to only use the Beta API when doing a load
Setup code to compare our data for the Beta API study statistics endpoint for data validation purposes
Add messages to data_verification to know what is happening
Fixed Util::Updater tests
Setup code to create flat files for beta
Setup code to create snapshots for beta
Added an example connections.yml file
Added a script that compares between schemas tables with that only have one occurrence of each nct_id
Adds the ability to process the beta api in parallel to speed processing
Currently using 32 threads to speed processing
Quick estimates show the processing time of around 5 hours for the entire database
Created a migration that renames ctgov_beta_group_code to ctgov_group_code
Updated the README so the setup steps are cleare
Added ability for the StudyJsonRecord.data_verification method write out the results from the comparison to a file
Added a method that loads a small number of studies for development
Renamed ctgov_beta_group_code to ctgov_group_code
Added mesh_type column for browse_conditions and browse_interventions tables
Edited code to save more MeSH types to the database
changed factory_girl to the factory_bot gem
added Airbrake for tracking errors
Added database explanation and visualization to the README
Edited how the downcase mesh terms are created so they would be updated regularly
Edited the way the loads are run to better switch between schemas
Added environmental variables for the ctgov_beta schema and database
We renamed "ctgov.categories" to "ctgov.search_results" to better relect that the table is a collection of search reseults saved from queries in the "ctgov.study_searches" table. We also added a categories view so any queries you have written for categories should still work. Please reach out if you have any issues.
The saved queries in the "ctgov.study_searches" table are used to search ClinicalTrials.gov.
We upgraded Rails to version 6 and updated all dependencies.
The source code was also updated to reflect this and security was strengthened.
ClinicalTrials.gov has a beta API they will be changing over to in the future. We built alternate code to process information the way the that API delivers it. That API also provides additional information that we are capturing as well.
The code that uses the regular API is still in use. The regular and the beta code run in parellel in AACT but store data in different databases.
The column ctgov_group_code is named ctgov_beta_group_code. Please note the codes themselves are different between the Beta and regular API on ClinicalTrials.gov.
Empty Design and Enrollment objects are not created if there is no data for them.
There are more agency_types. However, the number of agencies is the same between ctgov and ctgov_beta.
Reported event totals are handled differently in Beta on ClinicalTrials.gov. There is no total all cause mortality.
There is no download date for StudyJsonRecords, because none is provided by the ClinicalTrials.gov Beta API.
The Beta API provides Reported_event_totals with addtional data for the following columns: restriction_type, other_details and restrictive_agreement.
When no data is provided for a column the value is saved as null/nil (rather than an empty string).
There are some capitalization differences and minor verbiage differences.
Typically, two to four thousand studies are updated each day in ClinicalTrials.gov. On Nov 20, 2019, the ClinicalTrials.gov RSS Feed implemented a feature to restrict the number of studies returned per RSS request; the maximum number of studies returned is now 1,000. Therefore, since the full load was run on Dec 1, 2019, AACT has been importing info for only 1,000 modified studies per night. (All new studies have been correctly imported into AACT; this problem only applies to studies that have changed.)
The AACT process that imports studies via the RSS Feed has been modified to send repeated requests to the RSS Feed until we have collected all studies that have been modified during the past specified number of days.
We have continued to refactor code to make it easier for others to replicate the Ruby on Rails application that retrieves data from ClinicalTrials.gov to populate the AACT database. We replaced hard-coded references for database names & file locations with environment variables. We also removed some obsolete & unused code.
To make it easier for others to replicate the AACT component that creates a relational database of ClinicalTrials.gov data, we've refactored code, simplified environment variables & modified the README file in the AACT git repository.
When we upgraded to PostgreSQL 11.1, the format of the database log files changed slightly. As a result, the process that parses the log files failed to gather/summarize information. This has been fixed.
Someone setup unproductive bots that were consuming AACT database resources and dramatically slowing response time for others. We've implemented a feature that allows us to block IP addresses when we detect this type of activity.
We increased the maximum number of connections allowed to the live database due to recent, significant increase in the number of people accessing it. We continue to investigate other ways to support this open database model as activity increases.
We have renamed the AACT Projects feature to better describe the type of information it provides. We modified some navigation flows on the Data Share page to hopefully make it less confusing.
Since April, 2019 when we upgraded to PostgreSQL 11.1, the process that creates a static copy of the database has been unreliable. The static database copy serves two purposes: 1) it's used to refresh the live version of the AACT database that is available to the public and 2) it is made available for download so that others may use it to create their own personal copy of the database. The pg_restore comand used to refresh the public database has frequently failed since the 11.1 upgrade. Until now, we've been using the 9.6 version of the pg_dump command to create a copy of the database each night. To solve the problem with the nightly restore to the public database, we started using the 11.1 version of pg_dump to create the static database copy.
The table that lists & defines all AACT database tables/columns suddenly stopped appearing on the Data Dictionary page. We upgraded the jQuery plugin jsgrid so this data definition table would again appear on the page.
The 3 AACT applications (AACT, AACT-Admin & AACT-Proj) have been upgraded to ruby v2.4.5.
For technical people who would like to create their own instance of the AACT core application (the part that pulls data from ClinicalTrials.gov and populates a relational database), we have updated documentation in the ReadMe file that appears on the AACT main page in github. This documentation describes how to clone AACT from github and create a local, working copy. The documentation has been improved, but it is a work-in-progress. We will add more information in the next release of AACT.
AACT now includes the AACT-based research project: Characteristics of Clinical Trials Registered in ClinicalTrials.gov, 2007-2010. Datasets for this project are presented as tables in the proj_tag_study_characteristics schema. The tagged_terms table lists 2010 MeSH & free text terms that were determined to be associated with one (or more) of three clinical specialties: Mental Health, Oncology & Cardiovascular. (Each term is tagged with each related clinical specialty.) The new schema also includes a table of analyzed studies for each of the clinical specialties: mental_health_studies, oncology_studies & cardiovascular_studies. More information & downloadable datasets can be found on the Projects page.
To help us present more detailed information about each project, we've added a per-project page that may include the following information:
Click here to see the individual page for the newly added project: Characteristics of Clinical Trials Registered in ClinicalTrials.gov, 2007-2010.
The PostgreSQL public database was upgraded from version 9.6.6 to 11.1.
To make the database more robust and easier to understand, foreign key constraints have been added for all table relationships and unique constraints have been added for the NCT ID in tables that should only have one row per study:
We use pgModeler to produce the schema diagrams that appear on the AACT website. Because the previous version didn't work with PostgreSQL 11.1, we upgraded this tool to version 0.9.2-alpha1 & refreshed the schema diagrams. As a result, schema diagrams (such as the one for the main ctgov schema) now display additional information about primary keys, indexes, constrants, etc.
The mesh_terms table now contains the 2019 set of MeSH terms. (We used the 11/1/18 version of mtrees2019.bin retrieved from NLM's FTP server: ftp://nlmpubs.nlm.nih.gov/online/mesh/MESH_FILES/meshtrees/) The 2018 MeSH terms are now available in the mesh_archive schema (table name: Y2018_mesh_terms).
The values in calculated_values.minimum_age_unit & maximum_age_unit all had a leading space, so 'Years' was presented as '_Years' & 'Months' as '_Months'. This has been fixed so that these values will no longer include leading spaces.
To address security vulnerabilities detected by github, we upgraded the sprockets, loofah & rubyzip gems. Several other unused gems were removed.
On February 6, 2019, the National Library of Medicine (NLM) started making information about the documents provided with the study available through their API. AACT now includes this information in a table named provided_documents. The NLM briefly describes this information as follows:
"Data providers can submit documents including the Study Protocol, Informed Consent Forn (ICF), and Statistical Analysis Plan (SAP), possibly all in the same pdf document. These documents are archived, made available through the ClinicalTrials.gov site, and are now described in the Public XML."
The CDEK Standard Orgs project documentation that appears on the Projects page did not note that the set of organizations included in the AACT database are those identified as the sponsor, overall official or responsible party of interventional drug trials. We have modified this documentation to better describe this project.
We modified summary information on the Data Dictionary page to better describe the schemas recently added to the AACT database.
We added a table that defines a set of views in the AACT database that had previously been undocumented, all of which have the prefix 'all_'. Information about these views is now available on the Data Defintions page. These 'all_ views' provide concatenated value strings for various one-to-many study relationships. The values are delimited with a bar character (|). For example, the study NCT00000146 has 3 rows in the browse_conditions table, so one row for this study is in the all_browse_conditions view that provides this value: 'Multiple Sclerosis|Neuritis|Optic Neuritis'. These views are useful for those who need to export a spreadsheet of studies where each row represents one study, and the row includes one-to-many data values. More information about these views can be found near the bottom of the Data Defintions page.
To be consistent, each of these 'all_' views now has a column called 'names' which presents the concatentated list of the values from the table represented by that view. In short - the name of the column containing this concatenated list of values in now 'names' in every one of these views. Previously, some of these views used other names for these columns - for example, the all_conditions view previously used the column name 'conditions' instead of 'names'. If people referenced these undocumented views in the past, they need to be aware that some of these column names have changed.
We'd like to see what percent of studies plan to share individual patient data (IPD), and how it might change over time, so we added the studies.plan_to_share_ipd attribute to our 'enumerations list'. After each nightly update, we recalculate the ratio/distribution of values found in each attribute in the 'enumerations list' and display this info in the Enumerations column of the Data Dictionary table. (You need to scroll to the right to see this column.) Twice a month we save the enumerations data so we can monitor how the values in these attributes might change over time. For more information about the Enumerations feature, see v3.0.2 release notes.
Some database users have a single quote in their last name. When restoring the Users table, the restore function will become confused and give up unless these single quotes are escaped. We have updated the instructions to remind the administrator to escape all single quotes that appear in user names before running the restore command.
The AACT database now includes a set of supplemental schemas that present datasets collected & curated during previous AACT-based research. By including these data within the AACT database, the public can benefit from work that has been performed by other investigators. Since the information is directly accessible, it may be incorporated into queries on current clinical trials. It also serves to make previous research more transparent and help AACT users better understand assertions made by the previous investigators.
Database schemas are used to differentiate project-related data from ClinicalTrials.gov data. Data from ClinicalTrial.gov remain available in the ctgov schema and each project has a database schema in which the datasets for that project are available. All project schemas are prefixed with 'proj_'. With the release of AACT 4.1.0, all users of the live AACT database have immediate access to this information.
Datasets from the following three AACT-based research projects have been made available in this release:
proj_results_reporting: Anderson ML, Chiswell K, Peterson ED, Tasneem A, Topping J, Califf RM. Compliance with results reporting at ClinicalTrials. gov. New England Journal of Medicine. 2015 Mar 12;372(11):1031-9.
proj_cdek_standard_orgs: Griesenauer R, Schillebeeck C, Kinch MS. CDEK: Clinical Drug Experience Knowledgebase. bioRxiv The Preprint Server for Biology. 2018 November 19
proj_tag_nephrology: Inrig JK, Califf RM, Tasneem A, Vegunta RK, Molina C, Stanifer JW, Chiswell K, Patel UD. The landscape of clinical trials in nephrology: a systematic review of Clinicaltrials. gov. American Journal of Kidney Diseases. 2014 May 1;63(5):771-80.
Projects are described on the AACT website Projects Page. Definitions for each project's tables & columns are also defined in the Data Dictionary.
This feature will continue to be developed and your feedback is appreciated. Please email the AACT team with questions and suggestions.
This feature has been implemented as a separate Ruby on Rails application. AACT is now comprised of 3 applications: 1) AACT Core, 2) AACT Admin & 3) AACT Projects. All code for these three components is publicly available in github. Note: Implementing this feature required some changes to the way ClinicalTrials.gov data is loaded into AACT. Details about these changes are available upon request.
The National Library of Medicine (NLM) updates the MeSH thesaurus each year. To facilitate access to the set of terms used by previous research projects, a new schema named mesh_archive has been added to the live AACT database. Tables in this schema are named yYYYY_mesh_terms where YYYY identifies the version of that set of terms. For example, the 2010 set of MeSH terms is available in mesh_archive.y2010_mesh_terms.
The Calculated_Values table has 3 new columns that provide the number of primary, secondary & other outcome measures:
These are integer columns. Values are calculated by summing the number of rows in the design_outcomes table per study where outcome_type is primary/secondary/other.
The 'Row Count' & 'DB Section' columns have been removed from the data dictionary because this information is displayed further down on the same page in the section that defines AACT tables. (The information is table-specific, not column specific, so belongs in the section that describes the tables.) A column has been added to the data dictionary to display the database schema name. Although the schema name is also table-specific (not column-specific), it is presented in the data dictionary because, as a searchable column, it can be used to filter on all rows associated with a certain schema/project.
The data dictionary now includes rows to describe all project-related tables and columns.
The pagination numbers at the bottom of the data dictionary table were scrunched together. This has been fixed.
For AACT Admins Only: The AACT website page which lists information about all users has been enhanced; the information is now sortable & the table includes pagination. The option to download user information as a CSV or Excel file is selected, the content of the download only contains information that is of potential interest; attributes containing encrypted values no longer appear in the file. (This page is only accessible to AACT administrators.)
To simplify the management of user database accounts, we have created a role named 'read_only' in the AACT database and now assign all AACT users to this role. With this change, we are able to grant/revoke privileges to/from this one role rather than having to do it for each individual database user. (The search path must be specified for each individual user however, since it is not inheritable via the associated role.)
A process now records the total number of times each username submits a call to the public database. Currently, the process only collects information about the number of times a user makes a call to the database; it does not track the actual queries. The process uses a shell script to parse the public database logs every Sunday, counts the number of times each user posted a database event and saves this information to the db_user_activities table in the aact_admin database.
Until now, the AACT database has provided two of the six data elements related to individual participant data (IPD) sharing: 1) a yes/no value indicating whether the study planned to share this information and 2) a description of the plan. On August 24, 2018, the National Library of Medicine (NLM) added the other four IPD-related attributes to the ClinicalTrials.gov API, so they are now available in the Studies table of the AACT database.
The 'has_us_facility' value saved to the CalculatedValues table is now set to 'true' for studies that have at least one facility in the United States or a US Territory. The decision to include US Territories was based on NIH's 'Checklist for Evaluating Whether a Clinical Trial or Study is an Applicable Clinical Trial (ACT) Under 42 CFR 11.22(b) for Clinical Trials Initiated on or After January 18, 2017' (A country is considered a US Territory if it is one of those defined by the World Atlas.)
After the full database refresh that happens on the first of each month, we delete the previous month's set of daily static database copies and pipe-delimited file sets. Until now, this process has been manual. A process has been implemented to automatically remove these files on the first of the month.
With the release of AACT 4.0.0, we divided AACT into two separate applications: AACT & AACT-ADMIN. Some unnecessary code was left over in both apps. We've gone through and cleaned up the apps to remove superfluous code.
Previously, user information was backed up as one of the final steps in the nightly data load process. Now administrative tasks are performed by the AACT-ADMIN application, so user info backups are no longer a part of the data load process. (The AACT-ADMIN application is now responsible for backing up user info and the AACT application for loading the database.) A cron job has been setup to backup user information every morning at 4am.
When user information is backed up each morning, AACT administrators receive an email message that includes the backup file attachments and instructions about how to recover info from these files. The instructions in this email have been improved.
To simplify the process to grant/revoke user access to the public database, shell scripts have been created that can be quickly run to perform these tasks. The scripts are also used by rspec tests to confirm user maintenance functionality.
A user noted a critical error in the website documentation that describes how to create a local copy of the AACT database. The command to restore the database from a dump file downloaded from the website identified the default database 'postgres' rather than the aact database. The command has been corrected:
-> pg_restore -e -v -O -x --dbname=aact --no-owner --clean --create ~/Downloads/postgres_data.dmp
AACT has been divided into two applications: one solely dedicated to populating the AACT relational database with data from ClinicalTrials.gov and the other to manage all other supporting functionality such as maintaining user accounts and hosting this website. Both applications use Ruby on Rails and PostgeSQL, and are publicly available on github:
Users will not be directly affected by this change; it simply makes it easier to support the system and positions AACT to be more easily replicated by other organizations/people.
To comply with Article 17 of the General Data Protection Regulation (aka 'The Right to be Forgotten'), we have verified that AACT does not save any information about a user who has chosen to be removed from AACT.
Tables added to AACT in version 3.1.2 (Documents & Pending_Results) are now defined in the Table Definition table on the Data Dictionary page of the AACT website.
PostgreSQL recognizes mixed-case objects and requires double quotes when managing such objects. To avoid confusion and complexity, we now prevent the creation of mixed-case database usernames.
Added a page for technical documentation. (Accessible to AACT administrators only)
Added a page of instructions to stand up an instance of AACT on a Windows 10 machine.
On May 9, 2018, the National Library of Medicine (NLM) added data about 'pending results' to the ClinicalTrials.gov API. A Pending_Results table has been added to the AACT database to present this new information.
NLM provides result submission date(s) for studies that have results awaiting quality control (QC) review. The results themselves are not publicly posted until the review is complete. The dates for three types of events related to results submission are reported in the Pending_Results table:
The NLM reports that the following updates occur to this information when a study passes the quality control review:
The ClinicalTrials.gov API provides information about & links to documents related to a study. NLM provides the following information about these data:
The full study protocol and statistical analysis plan must be uploaded as part of results information submission, for studies with a Primary Completion Date on or after January 18, 2017. The protocol and statistical analysis plan may be optionally uploaded before results information submission and updated with new versions, as needed. Informed consent forms may optionally be uploaded at any time.
AACT now saves this information to the Documents table. Please refer to NLM Results Data Element Definitions and the AACT Data Dictionary for more detailed information about study documents.
On May 3, 2018, NLM posted this comment to their API schema documentation:
As promised in 08/30/2017 entry above, old redundant date names have been retired and their tags removed. Please update systems to stop using the date on the left in favor of the date on the right.
obsolete tag replacement tag <firstreceived_date> <study_first_submitted> <firstreceived_results_date> <results_first_submitted> <firstreceived_results_disposition_date> <disposition_first_submitted> <lastchanged_date> <last_update_submitted>
All these date attributes are stored in the Studies table. On January 22, 2018, the obsolete date tags/columns were identified as deprecated and new columns were added that mimic the new labels defined by NLM. The columns are:
|Deprecated Column||Replacement Column|
With this release, the deprecated columns have been removed.
ClinicalTrials.gov has made changes to the API (adding new tags; removing deprecated tags), so we needed to update the studies used by automated test scripts; the tests need to use data that accurrately represents the current structure of the ClinicalTrials.gov API. The latest version of all test studies were downloaded and test scripts were updated to address all changes.
To ensure we're able to recover user account information if necessary, we have added a step to the nightly update process that extracts all data from user-related tables and user account information and emails this to AACT Administrators along with instructions about how to run the scripts to restore the information.
A page to display all registered users has been added. It is only accessible to AACT administrators.
The documentation that explains how to use SAS to connect to AACT needed to be tweaked. The sample script was missing the line that identifies the user's password. We also fixed some awkward-looking fonts.
All data retrieved from ClinicalTrials.gov is saved into a schema named 'ctgov'. Before, when standing up a new instance of the AACT database, we needed to manually create the ctgov schema, grant privileges to the database administrator and define 'ctgov' as the default schema. We have now modified the database initialization process so that the ctgov schema is automatically created so that the tables, views and indexes are saved there without requiring any extra manual steps.
If a user forgot their password and clicked the link to receive an email to reset it, the process raised an error after they entered their password and confirmation password. This bug has been fixed.
Prior to Version 3.1.0, the AACT database did not own any data; all information in AACT was retrieved from ClinicalTrials.gov. The database could be (and frequently was) wiped out and recreated from this data source.
With the introduction of a user registration feature, AACT is now the system of record for user account information and must therefore ensure copies of user-related information are backed up and can be restored if necessary. We've setup a daily pg_dump process to create copies of the admin database (which contains a table of Users), and a pg_dumpall --globals-only process to save the database accounts (username/password/access rights) created in the publicly accessible AACT database.
As noted, the only reason to backup the public AACT database is to ensure we have restorable copies of user accounts. Since the actual content of the database can be recovered from ClinicalTrials.gov, only account usernames, encrypted passwords and ACL information are backed up.
With this release, users of the live AACT database will need to register and receive an individual user account to access the database. Individual accounts will replace the single common login-name/password (aact/aact) that has been used until now. To register and get a database account, please visit the AACT website and click Sign-Up in the upper right corner of any page.
The registration process is automated, using standard methods to verify the email address you provide. This should take about 5 minutes. If you have questions or encounter problems, please send email with the word 'registration' in the subject line to firstname.lastname@example.org.
While your login-name & password will change from aact/aact to the login-name/password you define, all other connection information (hostname, database name, and port number) will remain the same.
The previous login-name/password (aact/aact) will remain active for several weeks while people become aware of this new requirement and have the chance to create and test their new database account.
User registration will allow us to contact people about scheduled downtimes and other events. It also helps us monitor and manage database activity.
You can download static copies of the database and the pipe-delimited flat file sets without creating an account; if you only use these resources, you need not register unless you wish to receive email notifications.
In preparation for future enhancements that will provide supplemental information to enhance/annotate ClinicalTrials.gov data, all current AACT tables (ie. tables containing only data retrieved from ClinicalTrials.gov) have been moved to a schema named 'ctgov'. All database user accounts will define 'ctgov' as the default schema, so SQL queries need not specify this.
Queries created to run against the previous version of AACT that do not explicitly prefix table names with 'public.' should continue to run without needing any change. If however, your queries have prepended 'public.' to the table names, you will need to either remove these prefixes or change them to 'ctgov.'
Note: This change has no impact on users of the pipe-delimited flat file extracts.
Until now, downloadable copies of the AACT database (a static pg_dump copy and a set of 40 pipe-delimited flat files) have been created once a month and made available on the download page of the AACT website. Several people have expressed interest in getting these downloadable resources more frequently. As of this release, a static copy of the database and a set of pipe-delimited files are created & published to the download page after each nightly load.
To prevent the accumulation of hundreds of copies of the database through the year, these daily copies will be available for download only until the end of the month. Downloadable copies made on the first of the month will continue to be archived and made permanently available via the website. Both daily and monthly downloadable files can be retrieved from the download page of the AACT website.
(Prior to January, 2018, downloadable copies were created monthly, but not on the first. Going forward, these should be consistently created and dated on the first of each month.)
A page displaying the schedule for updating the database has been added to the AACT website; it is accessible from a display card at the bottom of the 'Learn More' section of the site
Some columns contain a limited number of possible values; several such columns are enumerated in the Data Dictionary, displaying the total number of rows with each value and the percent distribution. For example: on Februrary 28, 2018 the enumeration summary for Designs.primary_purpose was:
We are now saving enumeration information to an administrative table so that trends can be identified with the passage of time. This information will also help us verify the accuracy of updates by comparing current percent distributions to previous distributions. If values change dramatically, an alert is sent to AACT administrators.
We have improved the process that updates the AACT database by making the following changes:
Each night we refresh a 'background' copy of the AACT database and then use pg_restore to copy it to the publicly accessible database. In the past, if people were logged into the public database, the pg_restore command hung and the refresh failed. To prevent this, all database sessions are now terminated before the update process runs the pg_restore command. This typically occurs around 1am EST.
To prevent users from logging in while the refresh is under way, the process locks the public database before starting. Until now, if the refresh terminated unexpectedly, the database remained locked and inaccessible. This has been fixed. Now we automatically detect when the process fails, unlock the public database, and send an email notification to AACT administrators to report the failure.
A validation test has been added to prevent the public database from being refreshed if the number of studies in the updated database appears to have decreased.
The email notification that is automatically sent to AACT administrators after every database refresh now provides the list of NCT IDs that were added or updated. If the refresh failed, this is now noted in the subject line.
Every table has an NCT_ID column that serves as the foreign key to the Studies table. These columns need to be indexed so that queries run within a reasonable amount of time. Until now, these indexes were missing.
The database server's 60 GB of diskspace is inadequate - usage exceeding 90%. We have upgraded the server's resources as follows:
SSD Disk: 60 GB increased to 200 GB
Memory: 16 GB increased to 32 GB
CPUs: 6 vCPUs increased to 16 vCPUs
This 'Release Notes' page has been enhanced to include past release notes and facilitate documentation of future updates.
If a date value includes only the month & year (no day), we save that value as a string in a column - these string-type columns have the suffix: 'month_year'. The value is also saved to as a date-type value in a column with a _date suffix. (Example: Studies.start_month_year & Studies.start_date) We have been setting the day to the first day of the month in these date-type conversions. A user noted that the last day of the month is a perferred value. They noted: “these dates (start, completion, primary_completion) define when registration & results are due. A missing day value that defaults to the 1st of the month is the most restrictive and the last of the month is the most generous – for the purposes of compliance assessments” To be consistent we made this change for all data elements that can provide just month/day.
While changing the data value for month_year data elements, we noticed that the date-type value for Outcomes.anticipated_posting_date was not being provided. We have added this column to the Outcomes table.
On February 9th, we decommissioned the AACT database hosted on Amazon Web Service and the AACT website hosted on Heroku.
The primary objective for this release is to move the website, database and related code to servers hosted by Duke University and DigitalOcean in order to provide users with a static IP address for the database and to reduce monthly costs for hosting platforms. Below is a more detailed list of changes.
The previous version of the AACT database was hosted on the Amazon Web Services (AWS) Relational Database Service (RDS); the AACT website and data processes were hosted on a Heroku server. As of January 22, 2018, the AACT public database will now be hosted on a DigitalOcean server and the website, supporting databases and all system software will reside on virtual Linux servers maintained by Duke University's Office of Information Technology.
Website and Data Processing Server:
Advantages of this configuration:
Static IP Address: The AACT database will have a static IP address which is needed by organizations that employ whitelists to secure their local area networks. (Some firewalls are configured to only allow data-traffic to/from certain IP-addresses.) While AWS users can setup static IP addresses for their virtual private networks, AWS does not provide a way to define a static IP address for a specific database instance. The lack of a static IP address was a significant problem for several users.
Support: The Duke University Office of Information Technology (OIT) has a team of highly qualified server administrator who use established practices and tested procedures to ensure upgrades and patches are applied and servers remain up-to-date and secure. A service agreement is in place to guarantee on-going support.
Positioned for Growth: We need to address performance issues as more people discover and query AACT. By using a third-party service like DigitalOcean, we can easily replicate the database server to distribute the load across machines. If organizations or individuals want a dedicated instance of the database because they need reliably fast response times or would like to enhance the database with custom views, triggers, procedures, etc., we can help stand up 'private' servers and have processes refresh them nightly so they remain updated along with the 'public' database.
Reduced Cost: We expect the new configuration to significantly reduce monthly overhead costs.
In the previous version of AACT, the public database was taken down each evening for about one hour to apply all the changes that had been made in ClinicalTrials.gov that day. Periodically, a full refresh of the database was conducted; this process took approximately 15 hours during which time the database was inaccessible. To minimize such downtime, the load process has been reconfigured so that a background database is updated while the publicly accessible AACT database remains available. When the process completes, the publicly accessible version of the database is restored (via pg_restore) which takes less than 5 minutes. This model also allows us to verify that the load process was successful before the public database is updated.
On August 30, 2017, the National Library of Medicine (NLM) began providing a new set of dates for each clinical trial via the ClinicalTrials.gov API. The Studies table in AACT has been adapted to include these new date-type data elements:
String-type data elements added:
NLM deprecated four date elements (displayed in the left column of the table below) and recommended that users start using the alternative date element (on the right). NLM wrote: "Some existing dates are now redundant. They will be kept for some time to provide an opportunity for users of the XML to update their systems before being removed at a later date, probably in 2018."
AACT continues to provide the deprecated data elements. They will continue to be available in AACT until NLM removes them from their API.
AACT has been upgraded to Ruby 2.4.0 & Rails 4.2.9 (Previously: Ruby 2.2.3 & Rails 22.214.171.124)
We now reboot the database before launching the full load to disconnect user connections. Previously, the full load would hang if active sessions were running, waiting for a quiet database before it would start.
A Use Case Gallery has been added to the AACT website.
References on the website to static copies of the AACT database are now called 'static database copies' instead of 'snapshots'. Using 'snapshots' to refer to static copies of the database was confusing because this term has always been used to refer to the annual set of visualizations that summarize (snapshot) the 'state of clinical trials'.
The database refresh failed when executing the final step that retrieved logging information from AWS. When it tried to look at log file: error/postgresql.log.2017-03-08-20, AWS raising error: This file contains binary data and should be downloaded instead of viewed. (Service: AmazonRDS; Status Code: 400; Error Code: InvalidParameterValue; Request ID: c3ff20fc-05a1-11e7-96d9-2dc5508b92a3) We now catch this error and skip over it.
We reviewed database activity to identify suspicious activity and created a preliminary instance of the AWS suppression list to block potential hackers.
The footer on each page of the AACT website includes: 'Read our Citation Policy here', but the actual link (https://www.ctti-clinicaltrials.org/briefing-room/citation-policy) was missing. This has been fixed.
A Public Announcement feature has been added to provide AACT administrators with the ability to dynamically publish temporary information on the AACT website. For example, when the database is temporarily down because it's being refreshed, we now notify users by posting a public announcement for the duration of the downtime.
A feature to interrogate AWS database log files has been added which saves information about database activity to an administrative table in AACT. We are now better able to monitor database use.
All administrative tables have been moved out of the public AACT database and into a separate database (aact_admin) which is accessible to AACT administrators only. Admin tables are:
CalculatedValue.has_us_facility was incorrectly set to false during incremental/nightly loads. This has been fixed.
The nightly incremental load was not finding all the added & changed trials from the ClinicalTrials.gov RSS feed. We now send 2 RSS calls to ClinicalTrials.gov to get them all. Also, if a call to the ClinicalTrials.gov API times out, it now tries 5 times before giving up.
The set of pipe-delimited files was not getting generated as expected because the process aborted when it tried to create an index on a non-existent column: Calcuated_Values.sponsor_type. This problem has been fixed.
We have added a table to the Data Dictionary page to summarize all AACT database tables and provide their current row counts.
An enhancement has been made to the Data Dictionary page: the enumerations column in the table now displays the percentage for each element in the dropdown.
The Guide for Researchers now provides the effective date (January 18, 2017) for the NIH's recently published policy.
Mailgun was re-configured to belong to CTTI. It had previously been registered under StudyCo.
This release represents a significant upgrade that aims to make AACT easier to access and use. Since 2010, the AACT database has been published twice a year as a package that would be current as of a particular date: March 27 for the first annual installation, and September 27 for the second.
The package contained the content of ClinicalTrials.gov as 1) an Oracle database instance, 2) a set of SAS cport files & 3) a set of pipe-delimited files. It also included documentation in spreadsheets. Each package was made available to the public on the CTTI website. These packages remain available here. Until now, the use of AACT involved download/setup that required relatively sophisticated technical skills. AACT users also reported that the information was not current enough and the documentation difficult to use.
The code that generates the database has been proprietary and inaccessible to others who might want to replicate the process. The code used to create the AACT database and website is now publicly available in github. In summary, we have rewritten AACT to make it easier to access and understand, and to encourage others to replicate and make use of any aspect of it.
The AACT database is immediately accessible in the cloud, eliminating the need for users to download and install the data.
Each month, a static copy of the AACT database is saved and made available for download. The database platform, Postgres is a popular free open source database platform and requires relatively less technical know-how to setup than other larger platforms such as Oracle.
The database schema has been simplified and employs consistent naming and design conventions.
Documentation has been moved from spreadsheets to this website, and provides instructions about how to access and use AACT with instructions on how to access and use AACT with a variety of popular desktop applications including SAS, R, Tableau, and PostgreSQL tools.
A 'Calculated Values' table provides commonly-used, pre-computed values for each study such as total number of facilities and number of months to report results.
The public is free to download and recreate the full system or any part of it. All related code (Ruby on Rails) is available in github. This includes the processes that pull data from ClinicalTrials.gov and populates the postgreSql database.
Providing the public with direct, query-able access to a database in the cloud is not a common model and we have yet to determine how well it will serve hundreds or thousands of simultaneous users, however AWS cloud services provides the most promising alternative for scalable solutions. Another notable challenge has been the time required (~15 hours) to load 220,000+ studies. With recent regulatory changes, it’s likely the amount of data in ClinicalTrials.gov will grow at a faster rate; therefore CTTI continues to investigate ways to improve performance and reliability.
A beta version was released on October 1, 2016. Existing AACT users were asked to test the new version and their advice/suggestions were considered and implemented through the end of 2016. The official launch occurred January 31, 2017, just in time for the HHS ‘final rules’ to take effect.