Human and Machine Learning

Due 5/3/2018

I'm hoping that folks in the class would like
to do a final project. I will describe a couple of options in
class, and some of your colleagues may propose projects as
well. As an alternative to a final project, I give you the
option of doing a relatively larger homework assignment, which
is laid out below. This homework assignment is a bit more open
ended than the previous assignments, so you have an opportunity
to be creative and explore.

If you are concerned with grades or making up for missed work, we will look most favorably on an ambitious final project (Option 1), then next most favorably on Option 2 if you go above and beyond the bare minimum.

If you are concerned with grades or making up for missed work, we will look most favorably on an ambitious final project (Option 1), then next most favorably on Option 2 if you go above and beyond the bare minimum.

I described two possible projects in class.
One project involves using Bayesian optimization to discover an
optimal training policy for simulated students. These lecture
notes describe the project. The other project
involves predicting human forgetting over time. Here are slides that describe
the forgetting project, and here is
a few pages of a paper draft. I also encourage you
to devise your own project, hopefully a project that you can tie
into your own research interests. We'll spend some time in class
on April 17 discussing possible projects.

For Parts 1a and 1b, you will explore Bayesian optimization
using Gaussian processes. I would like you to get first hand
experience setting priors over hyperparameters, optimizing the
hyperparameters, iteratively running experiments, and
observing the convergence to a global optimum. You are welcome
to write your own software, but you can also use existing
software packages. Here are a few packages I know about:
MOE, Spearmint, BayesOpt, PyBO.

For Parts 2a-c, you will be exploring timeseries models.

Write code to implement a function over
which we will optimize. This function takes an input vector
and outputs a function value. I'd suggest using a
2-dimensional input space in order that you can easily
visualize the function as well as the progress of Bayesian
optimization. For example, your function might be
based on a mixture of Gaussians added to a linear function of
the inputs. Make the function complex enough that there are
some local optima and that the global optimum isn't in one
corner of the space. I'd like your function to add noise to
the returned value, e.g., Gaussian mean zero noise. Make a
plot of the expected value of the function.

Run Bayesian optimization with queries
being answered by a call to your function. For the Bayesian
optimization, do not build in any knowledge you have about the
solution, e.g., allow the optimization procedure to estimate
the function mean and observation noise variance.
Describe the assumptions you made in your model and
report on the outcome of experiments, e.g., report how close
the estimated optimum is from the true optimum as more
experiments are performed. To get a reliable estimate of
convergence, you'll need to run the optimization process
multiple times.

Test at least two different GP covariance functions, and run the optimization process enough times you can determine which covariance function converges most rapidly. Summarize your results in graphs and a few sentences.

Test at least two different acquisition functions (i.e., schemes for performing active selection), and run the optimization procedure enough times that you can determine which acquisition function is more effective. Summarize your results in graphs and a few sentences.

Test at least two different GP covariance functions, and run the optimization process enough times you can determine which covariance function converges most rapidly. Summarize your results in graphs and a few sentences.

Test at least two different acquisition functions (i.e., schemes for performing active selection), and run the optimization procedure enough times that you can determine which acquisition function is more effective. Summarize your results in graphs and a few sentences.

Generate a time series using a special case
of a linear dynamical system that is similar to an
autoregressive model in the time series literature.The hidden
state of the dynamical system will be a vector with 2
elements, which evolve over time according to:

y_{2}(t) = y_{1}(t-1) + μ_{1}

y_{1}(t) = α_{1} y_{1}(t-1)
+ α_{2} y_{2}(t-1) + μ_{2}

The observation will be a scalar:

z(t) = y_{1}(t) + μ_{3}

where for i={1,2,3}, μ_{i} ~ Normal(0,σ^{2})
and you should pick {α_{1,}α_{2}, σ} to
obtain interesting dynamics.

Plot a couple hundred points that are representative of this process.

y

y

The observation will be a scalar:

z(t) = y

where for i={1,2,3}, μ

Plot a couple hundred points that are representative of this process.

Turn your model into a switched linear
dynamical system with 3 modes.
Each mode has an associated set of {α_{1,}α_{2}}
parameters.

Switching is memoryless, i.e., at each time with some small probability β, the mode m, is re-drawn from the set {1,2,3} with uniform probability.

Draw a graphical model that depicts this switched dynamical system. Show the conditional distributions on each arc, e.g., write out the transition table for P(m(t) | m(t-1))

Generate a nice graph illustrating the switching dynamics.

Switching is memoryless, i.e., at each time with some small probability β, the mode m, is re-drawn from the set {1,2,3} with uniform probability.

Draw a graphical model that depicts this switched dynamical system. Show the conditional distributions on each arc, e.g., write out the transition table for P(m(t) | m(t-1))

Generate a nice graph illustrating the switching dynamics.

Now try to perform inference using the data
set you generated in either part 2a (easy option) or part 2b
(harder option).

If you use the data set from part 2a, then you can model it with a Kalman filter. Implement a KF and plot a portion of the time series along with the posterior predictive distribution. The posterior predictive distribution is P(z(t+1) | z(1), ..., z(t)). Since this distribution is Gaussian, plot the mean and some representation of uncertainty (e.g., lines at +/- 2 SD, or the shaded representation one sometimes sees.)

If you use the data set from part 2b, you could be ambitious and use a particle filter to perform inference, in which case you'll obtain a set of samples that represent the posterior predictive distribution. Or you could try using a model that isn't quite up to the task, e.g., an HMM with a Gaussian observation model (i.e., each state of the HMM outputs a constant value corrupted by Gaussian noise). You'd have to train the HMM, and it would require more than 3 states, since each state represents a constant level, but it would be interesting to see how poorly it models the data.

If you use the data set from part 2a, then you can model it with a Kalman filter. Implement a KF and plot a portion of the time series along with the posterior predictive distribution. The posterior predictive distribution is P(z(t+1) | z(1), ..., z(t)). Since this distribution is Gaussian, plot the mean and some representation of uncertainty (e.g., lines at +/- 2 SD, or the shaded representation one sometimes sees.)

If you use the data set from part 2b, you could be ambitious and use a particle filter to perform inference, in which case you'll obtain a set of samples that represent the posterior predictive distribution. Or you could try using a model that isn't quite up to the task, e.g., an HMM with a Gaussian observation model (i.e., each state of the HMM outputs a constant value corrupted by Gaussian noise). You'd have to train the HMM, and it would require more than 3 states, since each state represents a constant level, but it would be interesting to see how poorly it models the data.