2018 Data Science Day


Promoting Data Science across the University of Illinois

September 27, 2018
NCSA building

All day

Student Research Poster Session (see poster descriptions below)

NCSA 1040

8:00 am

Registration & Continental Breakfast

NCSA Lobby

8:30 am

Introduction: Robert J. Brunner

NCSA Auditorium

8:40 am

Keynote address: Bill Sanders, Interim Director, Discovery Partners Institute

NCSA Auditorium

9:00 am – 10:30

The Hesitant Data Scientist: Heidi Imker

Session 1: NCSA Auditorium

Identifying Problems at the Frontiers of Materials and Data Science: Elif Ertekin and Harley Johnson

Session 2: NCSA 1030


Break – Poster Session, students will be available to answer questions (see poster descriptions below)

10:45 am – 12:15 pm

The Hesitant Data Scientist: Heidi Imker

Session 1: NCSA Auditorium

Data Science and Genomics: Mikel Hernaez

Session 2: NCSA 1030

12:15 pm


1:00 pm

Keynote address: Provost Andreas Cangellaris

NCSA Auditorium

1:30 pm – 3:00 pm

Data Governance: Jana Diesner

NCSA Auditorium

3:00 pm

Break – Poster Session, students will be available to answer questions (see poster descriptions below)

3:15 pm – 5:00 pm

Community Data Science: Robert J. Brunner

NCSA Auditorium

5:00 pm




Title: Introducing the Sample Recycling Method: An Efficient Nested Simulation Procedure Presenter: Haoen Cui
Stochastic modeling is commonly used by financial reporting actuaries to perform valuation and risk assessment based on Monte Carlo simulations. Nested stochastic modeling is required when modeling components under each economic scenario are determined by other stochastic factors in the further future. Many existing techniques attempting to speed up runtime of nested simulations is based on the reduction of inner loop calculations by curve fitting techniques. Nonetheless, these techniques often require a large size of economic scenarios to develop accurate enough functional relationships, which could also be very costly to begin with. The proposed new technique (the sample recycling method) is based on an entirely different strategy, which is to avoid approximating functional relationship but instead to save time by recycling a limited set of economic scenarios. In this preliminary study, we empirically explore the sample recycling method on the valuation of European and Asian options with simulated stock price paths under a geometric Brownian motion. Comparing with the benchmark nested Monte Carlo approach, sample recycling demonstrates the potential to drastically save computing time by recycling inner sample paths, control estimation error with appropriately chosen reference points, and incorporate complex underlying stochastic processes via non-parametric density ratio estimations.

Title: Semantic Discovery via Singular Value Decomposition Presenter: Kevin Grosman
A Singular Value Decomposition (SVD) is a way to express a matrix as a product of other, more useful, matrices. It allows you to create a “Low-Rank Approximation,” which eliminates less important information and highlights significant associations. One of the most common applications of SVD is Semantic Analysis, the process of relating syntactic structures to their language-independent meanings. By comparing certain words and documents, we can uncover semantic families, or families of words often used together. In this project, we looked at a less mainstream application of SVD: DNA sequences. By comparing different amino acid sequences (words) and genomes (documents) taken from flies and other organisms from GenBank, we investigate the relationships of proteins (resulting from gene expression) and genotypes/phenotypes by producing SVD-derived plots.

Title: Tradeoffs between Safety and Time: A Routing View
Presenter: Ziwei Liu
This study proposes a data-driven combination of travel times, distance, and collision counts in urban mobility datasets, with the goal of quantifying how intertwined traffic accidents are in the road network of a city. Using a modification of the method of Lagrangian relaxation, we devise a novel routing algorithm to capture the tradeoff between travel time and accidents. We apply this to travel time and accident datasets derived from publicly available New York City taxi data. By visualizing the results of this computation in a scale-free way, we provide a comparative tool for urban traffic behavior.

Title: Assessing the National Transportation Library’s ROSA P using Data Visualization
Presenter: Aileen Nolan
ROSA P or the Repository & Open Science Access Portal is the designated institutional repository for research funded by the US Department of Transportation under the USDOT Public Access Plan. It includes full-text electronic publications, datasets, and other resources that are provided freely to transportation researchers, statistical organizations, the media, and the general public. In this project, data on ROSA P usage was visualized using Tableau software and included summaries of key metrics such as aggregated user location, download counts, and item resource type. The visualization provided a prototype of a tool that could be used for internal and external library assessment.

Title: Database‐Driven Materials Selection for Semiconductor Heterojunction Design
Presenter: Ethan Shapera
Heterojunctions are at the heart of many modern semiconductor devices with tremendous societal impact: Light‐emitting diodes shape the future of energy‐efficient lighting, solar cells are promising for renewable energy, and photoelectrochemistry seeks to optimize efficiency of the water‐splitting reaction. Design of heterojunctions is difficult due to the limited number of materials for which band alignment is known, and the experimental and computational difficulties associated with obtaining this data. Band alignment based on branch‐point energies (EBP) is shown to be a good and efficient approximation that can be obtained using data from existing electronic‐structure databases. Errors associated with this approach are comparable to those of expensive first‐principles computational techniques and experiments. EBP alignment is then incorporated into a framework capable of rapidly screening existing online databases to design semiconductor heterojunctions. The method is showcased for five different prototype cases: Transport layers are successfully predicted for CdSe‐ and InP‐based LEDs, and for CH3NH3PbI3‐ and nanoparticle PbS‐based solar absorbers. In addition, Cu2O as a possible hole‐transport layer for solar cells is examined. The framework addresses the challenge of accomplishing fast materials selection for heterostructure design by tying together first‐principles calculations and existing online materials databases.

Title: Neural Network based Approach for temperature control of Heat Exchanger Presenter: Sreenath Sundar
The following poster presents a couple of approaches for temperature control at heat exchanger’s exit. One of the these approaches is based on using n-step ahead neural network while the other approach utilizes a combination of control + learning module for temperature control.

Designing robust controllers for typical heat exchangers is very challenging primarily because of complexities associated with the synthesis of dynamics of industrial heat exchangers. Even if a dynamics model is synthesized, the accuracy of it would be called into question as there can be lot of unmodeled system dynamics and random noise in input parameters. In many situations, the degree of uncertainty in the model of system being controlled limits the utility of optimal control design. Controller can be manually tuned in field but it is very difficult to determine manual adjustments that result in overall improvement. These practical challenges call for learning based approaches for temperature control that would result in optimal heat exchanger performance.

We utilize the existing mathematical model for air-water system synthesized by Underwood and Crawford for testing our algorithms. Underwood and Crawford developed a model of heating coil by fitting a second-order non-linear equations to measurements of air and water temperature and flow rates obtained from the actual coil. The uncertainties in the model include air and water inlet temperatures and air-flow rate. These uncertainties are modeled as random variables that are sampled from a uniform distribution. The control variable is the water flow rate which is controlled for regulating the air-outlet temperature.

Neural networks are frequently used in various literatures for learning time-series. We propose a modified version of neural networks known as n-step ahead neural network that take the state variables and output at time t as input to successfully predict output at time t+n. The variable n is a hyper-parameter which can be optimized based on performance constraints. The n-step ahead neural network was then trained on training data generated from experimental model to high training accuracies. This trained model performed appreciably in its objective to regulate outlet temperature of a heat exchanger.

The second approach uses a combination of control and learning module for temperature regulation. The control module used is a proportional controller while the learning module used is steady state neural network. The steady state neural network is used to learn the controller output as a function of random variables and temperature set-point. The proportional controller is then used to compensate for transient behavior of linearized system around various operating points. The algorithm works effectively for temperature regulation as neural network captures the non-linear input-output behavior of closed loop system leaving proportional controller to compensate for steady state errors of a linearized plant.

The proposed algorithms’s performances were evaluated against the performance of a PI controller. Further, the robustness of the algorithms were also studied in great detail under small perturbations around tuned parameters of the proposed algorithms.