Saturday, March 14, 2015

What is Evidence?


In God we trust, all others bring data (Edward Deming).

I was discussing testing hypotheses with Damoon several days ago and I thought it is not clear how one can judge a theory using random observations. I came up with a simple example and thought better to share it with you.

It is counter-intuitive how we can extract evidence from data, if somebody brings it. There are many paradigms, I do not want to go into the details to make long and boring statements. I suggest the classic example: tossing a coin.
Lets ask a simple question: what sort of data is an evidence against fairness (or unfairness) of a coin? From fair I mean 1/2 chance of getting a Head, and 1/2 chance of getting a Tail.

Suppose somebody tosses a coin 4 times and gets:  Tail Tail Tail Tail
Such a coin looks suspicious, right? It seems we tend to believe the coin produces more Tails than Heads.

It is widely accepted  we vote against a theory (a theory sometimes is called assumption, sometimes called hypothesis) that produces suspicious results; from the result I mean data.

To have a better understanding of a suspicious result, lets compute what is the probability that a fair coin gives 4 Tails in 4 trials. 
(1/2)^4 ~ 0.06

Usually the threshold between being suspicious and being evidence is 0.05 (sometimes this value is 0.01 if a scientist is conservative). This quantity is related to p-value and testing statistical hypothesis. Interpretation of p-value is difficult, if you are interested, see this paper

Therefore, a scientist lives with the fair coin hypothesis if tosses a coin 4 times and gets 4 tails.

Now suppose we toss the coin five times and get Tail Tail Tail Tail Tail. Then what would be the decision of a scientist? Such data can be produced under the fair coin hypothesis with probability
(1/2)^5 ~ 0.03<0.05. So a scientist will believe that the coin is unfair!

I suggest you to toss a coin 5 times, I bet you get at least one Head or Tail, try it if you do not believe me.

Wednesday, March 11, 2015

First Step in Data Analysis!

I found that most of the time while you receive a new data set you simply do not know how to start.
Always visualization (hopefully combined with a general data analysis tool) is the first step. This gives you ideas about possible further steps to implement your knowledge discovery using more advanced computational or statistical techniques.
Most of the time one can arrange data in a matrix, with subjects in rows and variables in columns. I found sometimes, in practice, it is counter-intuitive to find out what must be the subject and what must be the variable. Lets discuss this fundamental issue later and do some analysis.
Lets start with a simple example, called iris data. It is already arranged in a matrix, hopefully with subjects in rows and variables in columns.
# load data
data(iris)

# see the data size
dim(iris)
## [1] 150   5
The iris data in R contain 150 rows and 5 variables. I am going to use only 4 variables out of this 5 (you will learn why at the end of this post).
I found a Heat-Map is the perfect tool to start data analysis. It visualizes the data with heat colours and implements two clustering trees, once on rows and another time on columns. Be careful, this tool does not work for large data (from large I mean matrices with dimension more than 1000X1000. That's why I checked the dimension first.
If I want to run the heatmap command on the iris data I have to use as.matrix function, since the iris data is in data.frame format, but heatmap accepts only a matrix. The data.frame and matrix format in R are quite similar. I suggest to use matrix format while your data all are quantitative.
# heatmap data
heatmap(as.matrix(iris[, 1:4]), cexRow=0.5, cexCol=0.8)
plot of chunk unnamed-chunk-2
You see three blocks of data on the right. This discovery is astonishing since the iris data matrix contain measurements of three different plants. 
Truth about the iris data:Row 1 to 50 are measurements from the stosa category, 51 to 100 from versicolor,and 101 to 150 from virginica.
Question: Now think why I excluded the 5th column of the analysis? (see below)
# 5th column
iris[ ,5]
##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     setosa     setosa     setosa     setosa     setosa    
##  [37] setosa     setosa     setosa     setosa     setosa     setosa    
##  [43] setosa     setosa     setosa     setosa     setosa     setosa    
##  [49] setosa     setosa     versicolor versicolor versicolor versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor versicolor versicolor
##  [73] versicolor versicolor versicolor versicolor versicolor versicolor
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica 
## [103] virginica  virginica  virginica  virginica  virginica  virginica 
## [109] virginica  virginica  virginica  virginica  virginica  virginica 
## [115] virginica  virginica  virginica  virginica  virginica  virginica 
## [121] virginica  virginica  virginica  virginica  virginica  virginica 
## [127] virginica  virginica  virginica  virginica  virginica  virginica 
## [133] virginica  virginica  virginica  virginica  virginica  virginica 
## [139] virginica  virginica  virginica  virginica  virginica  virginica 
## [145] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica