主题:Advanced high dimensional data analysis
时间:2016年6月26(星期日)9:00-17:00
地点:沙河校区,4号楼107
9:00 - 10:30Professor Ping Ma, University of Georgia
Title:Linear models, nonparametric models and algorithms
The main objective of this course is to provide students adequate knowledge and tools to build statistical models and design computational methods for big data arising in real life. The main emphasis will be on applications along with adequate understanding on some of the theoretical aspects.
PRIOR KNOWLEDGE:
General statistical sophistication and a solid understanding of Algorithms, Linear Algebra, and Probability Theory or equivalent.
Primary references:
1.A good general overview to the topic: Mahoney, “Randomized Algorithms for Matrices and Data," FnTML 2011. (http://arxiv.org/abs/1104.5557)
2. A recent work on linear models.: Ma, P., Mahoney, M.W, and Yu, B. (2015) A statistical perspective on algorithmic leveraging, Journal of Machine Learning Research, 16(Apr):861?911. (http://www.jmlr.org/papers/volume16/ma15a/ma15a.pdf)
3. A recent work on nonparametric models: Ma, P., Huang, J. Z. and Zhang, N. (2015) Efficient computation of smoothing splines via adaptive basis sampling, Biometrika, 102(3):631-645.(http://malab.uga.edu/wp-content/uploads/2015/07/MaZhangHuang2015.pdf)
10:40 - 12:00胡涛, 首都师范大学
题目: 区间删失数据:从一元到二元
摘要: 区间删失数据产生于生物学、医学、社会学和可靠性工程学等许多领域。对区间删失数据来说,只知道或观测到感兴趣的失效时间落在一个区间,而不能够精确观测。区间删失数据经常出现于定期跟踪或检测的医学研究中(如临床试验)。 在这些研究中,事件(如疾病)只发生在两次门诊或检查之间,因此对事件发生来说研究者只能获得区间删失数据。区间删失数据经常出现于定期跟踪或检测的医学研究中(如临床试验)。 在这些研究中,事件(如疾病)只发生在两次门诊或检查之间,因此对事件发生来说研究者只能获得区间删失数据。如果生存分析研究中感兴趣的失效时间有两个,这两个时效时间可能相关而且对这两个失效时间只能获得区间删失数据,我们就称这样的数据为二元区间删失数据(bivariate interval censored failure time data)。本次报告将介绍区间删失数据的定义、基本统计分析方法和近期获得一些结果。
14:00 - 15:30Professor Wenxuan Zhong, University of Georgia
Title:Correlation pursuit for high dimensional data
In this talk, I will introduce COP, a stepwise variable selection procedure under the sufficient dimension reduction framework in which the response variable is influenced by the predictors through anunknown function of a few linear combinations of them. Unlike linear stepwise regression, COP does not impose a special form of relationship (such as linear) between the response variable and the predictor variables. The COP procedure selects variables that attain the maximum correlation between the transformed response and the linear combination of the variables. Various asymptotic properties of the COP procedure are established and, in particular, its variable selection performance under a diverging number of predictors and sample size is investigated. The excellent empirical performance of the COP procedure in comparison with existing methods is demonstrated by examples.
Short Bio
Dr. Wenxuan Zhong is the associate professor of statistics and co-directs a big data analytics lab at the University of Georgia. Dr. Zhong is also the founding director of the Big Data Analytics ResearchInitiative. Dr. Zhong graduated with a B.S. in statistics from Nankai University (China) and a Ph.D in statistics from Purdue University. She then worked as a postdoc fellow in Professor Jun Liu’s Lab in statistics department and FAS center of systems biology in Harvard University. Zhong’s research focuses on developing novel statistical theory andmethods to overcome various challenges arise from bioinformatics research and the big data regime. Over the past few years, she has gradually established a diverse extramurally funded research program focusing on developing novel statistical methods for data collected in genomic, epigenetic and meta-genomic researches, and to overcome the computational and theoretical challenges arise from the big data analysis. Zhong has published multiple articles on high impact statistical and bioinformatics journals. Dr. Zhong's research is highlighted by The University of Georgia Columns and College of Liberal Arts & Sciences University of Illinois at Urbana-Champaign.
15:40 - 17:00Feng Li, Central University of Finance and Economics
Title:Bayesian Modeling Tail-Dependence of Stock Returns and News Sentiment with Copulas
Tail-dependence modeling based on copula with flexible marginal distributions is widely used in financial time series. Most of the available copula approaches for estimating tail-dependence are restricted within certain types of bivariate copulas due to computational complexity. We propose a general bayesian approach for jointly modeling high-dimensional tail-dependence for financial returns and related news information. Our method allows for variable selection among the key words in news in the copula tail-dependence parameters. We apply an efficient sampling technique into the posterior inference where the likelihood function is estimated from a random subset of the data, resulting in substantially fewer density MCMC evaluations.