Wednesday, April 29, 2015

Big data challenges



These days big data is something that every company has to deal with. When it comes to big data, the most important challenge was how to split data into pieces, such that we could run and implement each piece on various models.

I think Hadoop is a good choice to tackle with big data, in fact it is useful in many companies. The question is how we can make our raw data to understandable data for these kind of tools? Hadoop is an open source software which does the parallel learning and combining procedures.

In fact changing the raw data to something meaningful is currently a big issue for most companies especially in terms of time, as machine learning systems need the data to be in row and columns format. there is another name for this which is prep process.

Sometimes there is need to perform thousand of steps to prepare the data for analyzing. Recently ALFA made a some breakthroughs in this area which makes the subject totally interesting to follow for machine learners. You may find it quiet interesting as well!They reduced the time of prep processing as well as making the process automated.

The most amazing part of this area is the existence of a decision making among the progress, which means people can be used in order to predict.

  

Thursday, April 23, 2015

Find your favorite music by data visualization


Liveplasma is one of the best data visualizations application I've ever found.
It is a music and movie visualization app that aims to help you discover other musicians or movies you might enjoy. Type in the name of a band, artist, movie, director or actor and liveplasma will show you related people, bands or movies.


Maps try to show similar artists close together so different clusters can appear on map like families of related artists.

Links shows the raw data that are compute to place the nodes, they are less relevant than the position of the nodes but they contribute to understand the relations.
Halo's size reflects the popularity of the artist so it is easier to identify popular artists from obscure ones.
Colors are fun! Colors help identify artists from map to map and the clustered artists tend to adopt the same color. Basically popular artist have their own color and this color spread to similar less known artists.

Enjoy using Liveplasma!

Fencepost errors!




The other day I was surfing the net and I came up with couple of interesting problems that one of which is Fencepost error. To be honest I have never heard the word itself before, so thought of sharing it here! The name comes from a sort of simple tricky question that illustrates the error. If you are building a fence with fence posts that are 10 feet apart, and you want to build the fence to be 100 feet long, how many fence posts do you need?  10 comes out as a first intuitive answer which is quite WRONG! The fence needs to have 10 intervals between posts, and the extra post is needed to begin the first interval, so we need 11 fence in total. Our first answer was off by one and thats why the term "off-by-one'' is used exchangeably for this error.


Fencepost errors are the most common programming mistakes. The technique of extreme cases helps us to find and fix the bugs come from these counting problems.

  • Example:
 Consider the sum of the first n odd integers:

             
                                             
S=\underbrace{1+3+5+\hdots}_\text{n terms}
Odd numbers are of the form 2k+1 or 2k-1 . Now, how would you answer this question! Is the last number 2n+1 or 2n-1?

For general n , the answer is not obvious. You can figure it out, but it is easy to make an algebra mistake and be off by one term, which is the difference between 2n-1 and 2n+1. An extreme case settles the question. Here is the technique how to do it.

  1. Pick an extreme value of n , one where the last term in the sum is easy to determine.

  2. For that n , determine the last term.

  3. See which prediction, 2n-1 or 2n+1 is consistent with this last term.

  4. The most extreme value of n is 0. Since n is the number of terms, however, the n=0 does not make sense here. The next most extreme case is n=1 . With only one term the final term is 1, which is 2n-1 . So the final term, in general, is 2n-1. Therefore, the sum is
    S=1+3+5+\hdots+ 2n-1
Employing the sigma notation, it is
S=\sum_{k=0}^{n-1} (2k+1.)
This could be helpful in spotting the bugs in computer programmings.

Wednesday, April 22, 2015

Rejection sampling

Inference
Suppose that you wanna evaluate a joint probability density function \( f \) in an arbitrary interval. The conventional statistical techniques tell you to compute the integral of the function \( f \) in the given interval to find the density of the function in that certain region. Additionally, suppose that function is somehow complex so that calculating the integral is not an easy way to compute the density. This is the case where Monte Carlo inference can be deployed to help us computing the integral implicitly without integrating the function explicitly. Monte Carlo inference is a handy technique which allows us to calculate complex integrals with a little easy math though expensive computation that needs a bit trick to adjust few constants and parameters.

Monte Carlo inference term refers to a general family of techniques which replaces an integral with a sum under certain assumptions. Rejection sampling is an instance of the Monte Carlo inference, that requires a proposal distribution, say \( q(x) \). By rejection sampling, even an unnormalized function \( \tilde{f} \) could be approximated if \( M q(x) \geq \tilde{f}(x)\) for all values of \( x \) and some constant \(M\). Let's start with an example to demonstrate how rejection sampling works. A bivariate joint density function \(f(x_1, x_2) \sim \exp\{ -\frac{1}{2} [ \frac{x_1^2}{100} + (x_2 + 0.05x_1^2 − 5)^2 ]\}\) is given on \( \mathbf{R}^2\) up to an unknown constant, and I'm asked to compute the probability of the rectangle \( \{(x_1, x_2) : (x_1, x_2) \in (−3, −1) \times (5, 6) \} \). To this end, I should compute the integral of the exact function \(f(x_1, x_2)\) on the given rectangle. So far I have two problems in doing so, 1) I do not have the exact function, rather I have the unnormalized version of this function in \( \tilde{f}(x_1, x_2)\), 2) even if I had the function \(f\), I would be too lazy to review my calculous II to come up with calculating the 2D integral again! So it's turned out, the role of the Monte Carlo inference to approximate the integral with sum. Now, I wanna find an upper bound function \(q(x)\), say the envelope of the function \(\tilde{f}\) to preserve the aforementioned constraint. What I can do to find this envelope is to depict the function \(\tilde{f}\) to get a sense of how this function may look like. So here you go,




The above graphs show that the function \( \tilde{f}(x_1, x_2)\) is a unimodal function with two tails, so it could be imagined that a bivariate normal distribution function can simply cover this function as a proposal distribution with an appropriate \(M\). In the following graphs, the proposed envelope is depicted where \(M=argmax(\frac{\tilde{f}(x_1, x_2)}{q(x_1, x_2)})\).





Now, after having the proposal function \(q\), and the constant \(M\) fixed, we are able to draw samples from our proposal to approximate the unnormalized function \(\tilde{f}\), by keeping the accepted ones according to this rule,

iterate this procedure fairly large times,
1) Sample a uni-value random number \(u\) from a uniform distribution of \((0,1)\).
2) Sample a bivariate random number \((x_1, x_2)\) from our proposal \(q(x_1, x_2)\)
3) If \(u>\frac{\tilde{f}(x_1, x_2)}{Mq(x_1, x_2)}\), reject it. Otherwise, accept it.

In the following graph, fairly large number of samples are drawn from the proposal function \(q(x_1, x_2)\).

Then the accepted ones are depicted in the figures below,



As it could be seen, accepted samples are quite meaningful to capture the dense areas of the function \(\tilde{f}\). It's worse to mention that black squares, are the accepted samples in the rectangle \( \{(x_1, x_2) : (x_1, x_2) \in (−3, −1) \times (5, 6) \} \), thus the proportion of these black samples to the entire drawn samples denotes the probability of this region associated to the function \(\tilde{f}\). 

This was a short introduction to the rejection sampling technique by an example, which shows how one can do an approximate inference in the absence of the normalizing constant of the given mass function without any integration.

Appendix
In this example, \(\mu=(0, 10)\) is suggested as the mean vector of the proposal distribution with \begin{bmatrix}1000 & 100 \\100 & 250 \end{bmatrix} for the covariance matrix. The following code snippets could help you to implement the above figures.
The function definition is implemented here,
 
 f<- data-blogger-escaped-code="" data-blogger-escaped-exp="" data-blogger-escaped-function="" data-blogger-escaped-phi="" data-blogger-escaped-return="" data-blogger-escaped-x1="" data-blogger-escaped-x2="" data-blogger-escaped-x="">
Then \(n\) data is generated randomly to compute the values corresponding to the function \(\tilde{f}\) and \(q\).
 
 n=10000
  x1=runif(n, -20, 20)
  x2=runif(n, -15, 8)
  mn <- data-blogger-escaped-100="" data-blogger-escaped-10="" data-blogger-escaped-2="" data-blogger-escaped-add="TRUE)" data-blogger-escaped-blue="" data-blogger-escaped-c="" data-blogger-escaped-code="" data-blogger-escaped-col="cols)" data-blogger-escaped-cols="" data-blogger-escaped-dmvnorm="" data-blogger-escaped-f="" data-blogger-escaped-fv="" data-blogger-escaped-green="" data-blogger-escaped-matrix="" data-blogger-escaped-mn="" data-blogger-escaped-mycolorramp="" data-blogger-escaped-nrm="" data-blogger-escaped-open3d="" data-blogger-escaped-p="" data-blogger-escaped-plot3d="" data-blogger-escaped-red="" data-blogger-escaped-sigma="" data-blogger-escaped-x="x1," data-blogger-escaped-y="x2," data-blogger-escaped-yellow="" data-blogger-escaped-z="fv,col">
Finally, we should sample some data from \(q\) to accept those who satisfy the criterion, then portion of the accepted data in the given rectangle to the entire accepted samples, identifies the probability of the region.
 
 p=rmvnorm(n, mn, sigma)
  nrm<-dmvnorm data-blogger-escaped-0="" data-blogger-escaped-1="" data-blogger-escaped-c="" data-blogger-escaped-col="cols" data-blogger-escaped-cols="" data-blogger-escaped-f="" data-blogger-escaped-fv="" data-blogger-escaped-green="" data-blogger-escaped-idx="((x1" data-blogger-escaped-index="r<(tt)" data-blogger-escaped-mn="" data-blogger-escaped-mycolorramp="" data-blogger-escaped-nrm="" data-blogger-escaped-open3d="" data-blogger-escaped-p="" data-blogger-escaped-plot3d="" data-blogger-escaped-r="runif(n," data-blogger-escaped-sigma="" data-blogger-escaped-tt="(fv/(max(fv/nrm)*nrm))" data-blogger-escaped-x1="" data-blogger-escaped-x2="" data-blogger-escaped-yellow="">=-3 & x1<=-1) & (x2>=5 & x2<=6))
  x11=x1[idx]
  x22=x2[idx]
  fv1=fv[idx]
  plot3d( x11, x22, fv1, col = "black" , size=5, add=TRUE)  
In addition, the color spectrum is a handy way to show the density of a function,
 
 myColorRamp <- data-blogger-escaped--="" data-blogger-escaped-code="" data-blogger-escaped-colorramp="" data-blogger-escaped-colors="" data-blogger-escaped-diff="" data-blogger-escaped-function="" data-blogger-escaped-maxcolorvalue="255)" data-blogger-escaped-min="" data-blogger-escaped-range="" data-blogger-escaped-rgb="" data-blogger-escaped-v="" data-blogger-escaped-values="" data-blogger-escaped-x="">