Skip to Main Content
Go to Penn Libraries homepage   Go to Guides homepage
Banner: RDDS; Research Data & Digital Scholarship displayed between 3D mesh surfaces

Text Analysis

A guide to text mining tools and methods

TDM Studio by ProQuest

TDM Studio, by ProQuest, is a text and data mining solution for research, teaching and learning.

Proficiency in R or Python programming languages is useful but not necessary for text and data mining with TDM Studio. 

Anyone with a valid UPenn email address can access a workbench by logging in. By default, each workbench can support 1-5 users. 

  • Accessing UPenn Proquest TDM
    1. Anyone with a valid UPenn email address can request access to a workbench. To request a workbench, please send an email to please contact the Research Data & Digital Scholarship team or your subject specialist, with names and email addresses of everyone who would like to access the same workbench. You can always add new users at a later time. By default, each workbench can support 1-5 users.
    2. Once your workbench is created, you will receive a confirmation email that includes instructions on how to get started. To access your account and reset passwords, visit https://tdmstudio.proquest.com/home and login with the email address that you initially provided, and select  “Forgot Your Password”. You will be sent a forgot password link which can be used to set your password.
    3. A one-page quick start guide can be found at ProQuest LibGuides Quick Start
  • Create Dataset
    1. You can create a maximum of 10 datasets of up to 2,000,000 documents. You can search for specific Publication Titles or in licensed ProQuest databases, including ProQuest Historical Newspapers.
    2. Once you have clicked on "Create Dataset", you are returned to your dashboard and the dataset you just defined displays with the status of “Queued". TDM Studio processes 100,000 of documents an hour. Once your dataset is complete, it will show a status as "Completed". 
    3. A one-page dataset guide can be found at ProQuest LibGuides Dataset Creation
  • Analyze Dataset
    1. When you're ready to use a dataset, turn on the Jupyter Notebook environment. It can take up to 10 minutes for this process to complete. We recommend starting with the Start Here.ipynb file, which will help you visually select and transfer the datasets you would like to use for analysis.
  • Python / R Scripts / Templates
    1. Once you have opened the Jupyter Notebook environment, you will find detailed user manuals, tips in using the workbench, and collection of sample code to get you started.  We recommend starting with the Start Here.ipynb or the folder named ProQuest TDM Studio Samples.
    2. You can also upload your own scripts to the Jupyter notebook environment for data processing and analysis.
  • Export Data for further use / analysis
    1. You can export any derived data, as well as scripts, tables and visualizations. Due to copyright restrictions, you cannot export full text or any consumptive information that could be used to reconstruct the full text. The current export limit is 15 MB per week.

Getting Started

  1. Go to ProQuest TDM Studio (https://tdmstudio.proquest.com/home)

  2. Select “Create Account” in the top right corner

 

  1. After entering your Upenn email address, the system should automatically select “University of Pennsylvania Libraries” as your institution. Create your password, read and accept ProQuest TDM Studio’s Private Policy and Terms and Conditions, and click “Create Account”

 

  1. A confirmation email should be sent to your Upenn email inbox, continue by clicking on “Verify Email and Log In”. Your email address has now been verified. Click on “Log in to TDM Studio”, and now you are ready to log in by entering your email address and password

Creating a Dataset: A tutorial by ProQuest

You can create your own dataset by first generating a new Workbench Dashboard at https://tdmstudio.proquest.com/workbenchdashboard.

From the Workbench, select the “Create New Dataset” button to get started. Then, You will be able to select two options from a drop-down menu. Please note that:

  • Select “Publication Titles “allows you to limit your search to individual publication titles such as the New York Times or The Washington Post.

  • Select “ProQuest Databases” allows you to limit your search to individual ProQuest databases.

 

If there is a specific title you would like to include in your dataset, please use the search box in the upper right-hand corner to filter those names.

 

ProQuest has some tips to help you select the correct content. 

  • Sometimes, there will be multiple entries for the same publication title. The “Source Type” column will help you select the right one. 

  • You can use the “Full Text” column on the far right to determine whether your selected publication contains full text or not. 

  • Make sure that the publications that you select cover the period that you want. Some publications are split between historical and current versions, so it may be necessary to select different or multiple publications depending on the time span you want to be covered. 

  • If you are selecting multiple publications of the same name (their current and historical versions), try to generate your dataset starting from the most recent publication and going back chronologically. 

  • ProQuest offers an online module “What content is available?” that discusses more about content selection. 

 

There is no maximum number of publications that you can include in the dataset you create, but you can monitor the number of selected publications at the bottom of the page.  Once you have selected all the publications, click the “Next: Refine Content” button to proceed to the next step.

Refining Results

Refining your results is an important step since a dataset created by ProQuest TDM Studio can only contain up to 2 million records. 

You can use Boolean expressions (such as and, not, or, etc.) in the search box to control the search results. You can also refine by date published, source type, and document type. 

ProQuest offers the module “Best practices on searching ProQuest content by using search mnemonics and search tips” that covers more about results refinement.

When you are satisfied with the dataset that you have created, you can start the process to create your dataset by clicking the “Next: Review Dataset” button on the bottom right.

Dataset Processing

Then, you will need to name the dataset before creating it. You can also add any description that will help you later identify this dataset in your workbench once it is processed. Then, click on
“Create Dataset” on the bottom. You should be able to see a pop-up window confirming that your dataset is now being created.

Please note that the dataset does take some time to process. The processing time may take an hour to an entire day, depending on the size of the dataset, it can take an hour or just under a day.  

Closing this pop-up window will bring you back to the dashboard where you can see the dataset being queued for processing. Check back in a few hours to see if your dataset is ready!

Further Resources

Analyzing and Visualizing Text with Constellate and ProQuest TDM Studio: A guide introducing text analysis as a research method and a demonstration of Constellate and ProQuest TDM Studio. The guide contains slides, a recorded workshop, and instructions for using the platforms. 

Penn Libraries Home Franklin Home
(215) 898-7555