How to determine enough sample size? Is it possible to find similarity score between two distributions?
How to determine enough sample size? Is it possible to find similarity score between two distributions? If yes, please elaborate.
Distribution similarity example: Suppose we want to find a similarity between two documents using their words frequency, then, one way to do that is by using the frequencies of each word to build a histogram for each document. Finally, we want to find which distribution (histogram) is similar to what. What will be the best solution of tool?
Researchers use power analysis for determining sample size. Power analysis allows us to determine the sample size required to detect an effect of a given size with a given degree of confidence. Conversely, it also helps us to determine the probability of detecting an effect of a given size with a given level of confidence, under sample size constraints. According to the probability determined, researchers would alter or abandon the experiment. The power analysis can be done through the following five steps:
- Determine a hypothesis test
- Determine the significance level of the hypothesis test
- Determine the smallest sample size that is of scientific interest
- Estimate the value of other parameters required such as the mean and SD. This often requires a pilot study from which mean and SD are calculated.
- Determine the intended power of the test
Box plots are useful in determining the similarities as well as differences within the two distributions. Often box plots are solely used for the comparative study of distributions. However, one has to be aware of how to compare the two box plots. The boxes represent the interquartile range or the middle half of the values in each group. If two such boxes do not overlap with each other, then there is a difference between the two groups. This is followed by analyzing the median lines, whiskers (if any) and outliners ranging out from the boxes. Accordingly the similarities or differences can be computed within distributions.