Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples. This process allows you to calculate standard errors, construct confidence intervals, and perform hypothesis testing for numerous types of sample statistics. Bootstrap methods are alternative approaches to traditional hypothesis testing and are notable for being easier to understand and valid for more conditions.

In this blog post, I explain bootstrapping basics, compare bootstrapping to conventional statistical methods, and explain when it can be the better method. Additionally, I’ll work through an example using real data to create bootstrapped confidence intervals.

## Bootstrapping and Traditional Hypothesis Testing Are Inferential Statistical Procedures

Both bootstrapping and traditional methods use samples to draw inferences about populations. To accomplish this goal, these procedures treat the single sample that a study obtains as only one of many random samples that the study could have collected.

From a single sample, you can calculate a variety of sample statistics, such as the mean, median, and standard deviation—but we’ll focus on the mean here.

Now, suppose an analyst repeats their study many times. In this situation, the mean will vary from sample to sample and form a distribution of sample means. Statisticians refer to this type of distribution as a sampling distribution. Sampling distributions are crucial because they place the value of your sample statistic into the broader context of many other possible values.

While performing a study many times is infeasible, both methods can estimate sampling distributions. Using the larger context that sampling distributions provide, these procedures can construct confidence intervals and perform hypothesis testing.

**Related posts**: Differences between Descriptive and Inferential Statistics

## Differences between Bootstrapping and Traditional Hypothesis Testing

A primary difference between bootstrapping and traditional statistics is how they estimate sampling distributions.

Traditional hypothesis testing procedures require equations that estimate sampling distributions using the properties of the sample data, the experimental design, and a test statistic. To obtain valid results, you’ll need to use the proper test statistic and satisfy the assumptions. I describe this process in more detail in other posts—links below.

The bootstrap method uses a very different approach to estimate sampling distributions. This method takes the sample data that a study obtains, and then resamples it over and over to create many simulated samples. Each of these simulated samples has its own properties, such as the mean. When you graph the distribution of these means on a histogram, you can observe the sampling distribution of the mean. You don’t need to worry about test statistics, formulas, and assumptions.

The bootstrap procedure uses these sampling distributions as the foundation for confidence intervals and hypothesis testing. Let’s take a look at how this resampling process works.

**Related posts**: How t-Tests Work and How the F-test Works in ANOVA

## How Bootstrapping Resamples Your Data to Create Simulated Datasets

Bootstrapping resamples the original dataset with replacement many thousands of times to create simulated datasets. This process involves drawing random samples from the original dataset. Here’s how it works:

- The bootstrap method has an equal probability of randomly drawing each original data point for inclusion in the resampled datasets.
- The procedure can select a data point more than once for a resampled dataset. This property is the “with replacement” aspect of the process.
- The procedure creates resampled datasets that are the same size as the original dataset.

The process ends with your simulated datasets having many different combinations of the values that exist in the original dataset. Each simulated dataset has its own set of sample statistics, such as the mean, median, and standard deviation. Bootstrapping procedures use the distribution of the sample statistics across the simulated samples as the sampling distribution.

## Example of Bootstrap Samples

Let’s work through an easy case. Suppose a study collects five data points and creates four bootstrap samples, as shown below.

This simple example illustrates the properties of bootstrap samples. The resampled datasets are the same size as the original dataset and only contain values that exist in the original set. Furthermore, these values can appear more or less frequently in the resampled datasets than in the original dataset. Finally, the resampling process is random and could have created a different set of simulated datasets.

Of course, in a real study, you’d hope to have a larger sample size, and you’d create thousands of resampled datasets. Given the enormous number of resampled data sets, you’ll always use a computer to perform these analyses.

## How Well Does Bootstrapping Work?

Resampling involves reusing your one dataset many times. It almost seems too good to be true! In fact, the term “bootstrapping” comes from the impossible phrase of pulling yourself up by your own bootstraps! However, using the power of computers to randomly resample your one dataset to create thousands of simulated datasets produces meaningful results.

The bootstrap method has been around since 1979, and its usage has increased. Various studies over the intervening decades have determined that bootstrap sampling distributions approximate the correct sampling distributions.

To understand how it works, keep in mind that bootstrapping does not create new data. Instead, it treats the original sample as a proxy for the real population and then draws random samples from it. Consequently, the central assumption for bootstrapping is that the original sample accurately represents the actual population.

The resampling process creates many possible samples that a study could have drawn. The various combinations of values in the simulated samples collectively provide an estimate of the variability between random samples drawn from the same population. The range of these potential samples allows the procedure to construct confidence intervals and perform hypothesis testing. Importantly, as the sample size increases, bootstrapping converges on the correct sampling distribution under most conditions.

Now, let’s see an example of this procedure in action!

## Example of Using Bootstrapping to Create Confidence Intervals

For this example, I’ll use bootstrapping to construct a confidence interval for a dataset that contains the body fat percentages of 92 adolescent girls. I used this dataset in my post about identifying the distribution of your data. These data do not follow the normal distribution. Because it does not meet the normality assumption of traditional statistics, it’s a good candidate for bootstrapping. Although, the large sample size might let us bypass this assumption. The histogram below displays the distribution of the original sample data.

Download the CSV dataset to try it yourself: body_fat.

### Performing the bootstrap procedure

To create the bootstrapped samples, I’m using Statistics101, which is a giftware program. This is a great simulation program that I’ve also used to tackle the Monty Hall Problem!

Using its programming language, I’ve written a script that takes my original dataset and resamples it with replacement 500,000 times. This process produces 500,000 bootstrapped samples with 92 observations in each. The program calculates each sample’s mean and plots the distribution of these 500,000 means in the histogram below. Statisticians refer to this type of distribution as the sampling distribution of means. Bootstrapping methods create these distributions using resampling, while traditional methods use equations for probability distributions. Download this script to run it yourself: BodyFatBootstrapCI.

To create the bootstrapped confidence interval, we simply use percentiles. For a 95% confidence interval, we need to identify the middle 95% of the distribution. To do that, we use the 97.5^{th} percentile and the 2.5^{th} percentile (97.5 – 2.5 = 95). In other words, if we order all sample means from low to high, and then chop off the lowest 2.5% and the highest 2.5% of the means, the middle 95% of the means remain. That range is our bootstrapped confidence interval!

For the body fat data, the program calculates a 95% bootstrapped confidence interval of the mean [27.16 30.01]. We can be 95% confident that the population mean falls within this range.

This interval has the same width as the traditional confidence interval for these data, and it is different by only several percentage points. The two methods are very close.

Notice how the sampling distribution in the histogram approximates a normal distribution even though the underlying data distribution is skewed. This approximation occurs thanks to the central limit theorem. As the sample size increases, the sampling distribution converges on a normal distribution regardless of the underlying data distribution (with a few exceptions). For more information about this theorem, read my post about the Central Limit Theorem.

Compare this process to how traditional statistical methods create confidence intervals.

## Benefits of Bootstrapping over Traditional Statistics

Readers of my blog know that I love intuitive explanations of complex statistical methods. And, bootstrapping fits right in with this philosophy. This process is much easier to comprehend than the complex equations required for the probability distributions of the traditional methods. However, bootstrapping provides more benefits than just being easy to understand!

Bootstrapping does not make assumptions about the distribution of your data. You merely resample your data and use whatever sampling distribution emerges. Then, you work with that distribution, whatever it might be, as we did in the example.

Conversely, the traditional methods often assume that the data follow the normal distribution or some other distribution. For the normal distribution, the central limit theorem might let you bypass this assumption for sample sizes that are larger than ~30. Consequently, you can use bootstrapping for a wider variety of distributions, unknown distributions, and smaller sample sizes. Sample sizes as small as 10 can be usable.

In this vein, all traditional methods use equations that estimate the sampling distribution for a specific sample statistic when the data follow a particular distribution. Unfortunately, formulas for all combinations of sample statistics and data distributions do not exist! For example, there is no known sampling distribution for medians, which makes bootstrapping the perfect analyses for it. Other analyses have assumptions such as equality of variances. However, none of these issues are problems for bootstrapping.

## For Which Sample Statistics Can I Use Bootstrapping?

While this blog post focuses on the sample mean, the bootstrap method can analyze a broad range of sample statistics and properties. These statistics include the mean, median, mode, standard deviation, analysis of variance, correlations, regression coefficients, proportions, odds ratios, variance in binary data, and multivariate statistics among others.

There are several, mostly esoteric, conditions when bootstrapping is not appropriate, such as when the population variance is infinite, or when the population values are discontinuous at the median. And, there are various conditions where tweaks to the bootstrapping process are necessary to adjust for bias. However, those cases go beyond the scope of this introductory blog post.

Mark M says

Jim,

Your explanation of bootstrapping was so good that I subscribed to your site. I found probability easier to understand than statistics in college, so I am very glad I found your site. I do have a conceptual question about your (adolescent girl body fat) example. How do you estimate a minimum reasonable size for the sample set you plan to bootstrap from?

It seems like a number of factors would have a significant impact on body fat. If we use age (ten annual buckets – 10-19 years), parental income/education level (at least 3 buckets) and race (at least 5 buckets) those factors would generate (10x3x5) = 150 buckets. Given that, it seemed as if 92 data points is not enough to accurately represent the underlying data.

How does one figure out a minimum reasonable sample size? If you address this in one of your books, I am glad to buy it.

Sorry if this is a silly question.

Mark

Sujay Dutta says

Thanks for a great article, Jim. I have a question. One interpretation of parametric confidence intervals is that values near the center of the interval are more likely for the population parameter concerned than values near the peripheries of the interval. Does this interpretation apply to bootstrap CIs also?

Jeremy says

Thanks for the very good intro to boostrapping here, Jim! I have a couple of questions. If your sample has a wide range (i.e. a wide standard deviation), or even just a few extreme outliers, would your bootstrap-derived sample distribution end up being wider, too, or would it shift the mean? I guess the premise of bootstrapping is that the variation in your single sample will end up being mirrored in your bootstrap-derived distribution of sample means? My second question is how do you use bootstrapping for regression—do you simply get a bootstrapped distribution of sample means for each value of your regression coefficients to get confidence intervals?

MIke Coulthart says

Thank you Jim, for this wonderfully clear explanation of the principles behind the classical nonparametric bootstrap. Have you considered presenting a similar exposition of Donald Rubin’s (1981) Bayesian bootstrap? I have tried and tried to read this paper, but still cannot intuitively grasp it.

Jim Frost says

Hi Mike,

Perhaps that can be fodder for a future blog post!

saeideh says

Hello, Thank you for the tutorial.

I have a question. I have some subjects with different numbers of trials. since the number of trials affect the result first I want to equalize the number of trials and for example extract randomly 60 trials of each subjects. How can do this with bootstrapping?

Thank you in advance

Jim Frost says

Hi Saeideh,

My real strengths are in the parametric methods. In those methods, it’s ok to have different numbers of subjects between groups. It’s not the most efficient in terms of maximizing statistical power, but it’s ok. I know the same is true with at least some of the bootstrapping techniques. For example, I do know that you can compare the means of unequal size groups. There might be some alteration in the bootstrapping method to account for that size difference. I’d look into that before you attempt equalize group numbers.

Tony says

Hi Jim,

I am not really understanding how to create a confident interval from a bootstrap. Could you please give another example?

Thanks,

Tony

Jim Frost says

Hi Tony,

It’s really the process that’s important. Apply the following process to any sample.

All this method does is to take your sample and resampling it many times to create many bootstrap samples. It calculates the mean (or whatever you’re studying) for each bootstrap sample. Then it lines those up in order from low to high. From that line up, it picks out the middle 95% of values, or whatever confidence level you’re using. If you use 95%, then you know that 95% of your bootstrap sample means fell within that range.

It’s really that sample for bootstrap CIs. Some methods will add correction factors, incorporate PDFs, etc., but I’m just showcasing the simplest version to illustrate the principles behind how they work.

Gemechu Asfaw says

Hi JIM.

I am new for bootstrapping method. bootstrapping parametric or non parametric?. when we use it ?

Jim Frost says

Hi Gemechu,

There are parametric and nonparametric forms of bootstrapping. Read what I write in this comment about that issue.

I’d recommend using bootstrapping methods when you can’t satisfying the assumptions for a parametric or nonparametric non-bootstrapping test. You can read about that in my post about parametric vs. nonparametric tests. In some cases, you can use bootstrapping when a non-bootstrapping form of the test doesn’t exist.

I hope this helps!

Robert Matthews says

This was a really helpful addendum to Jim’s basic introduction; thanks for going to all the trouble of providing links too !

Adrian says

Dear Jim,

I am a big fan of your blog and admire not only your deep knowledge but also your extraordinary ability to teach it in a “digestive” way. That’s a virtue!

Now, let me add my few cents on these methods. They are very useful by relaxing the most problematic parametric assumption, but one should be very careful while using it. Let me list a few points, coming from my everyday work:

1) proper method should be used for proper goals. Bootstrapping tests doesn’t evaluate the p-value under the null hypothesis, unless additional steps are taken (shifting the data by the mean to simulate the null). Otherwise one may be *very* surprised.

Please find:

* “Computing p-value using bootstrap with R” – https://tinyurl.com/y4d2lu6o

* “Why shift the mean of a bootstrap distribution when conducting a hypothesis test?” – https://tinyurl.com/y4wra7z2

1a) Permutation tests do that. But the permutation test can be run if and if only the samples are IID = same sample shape and same dispersion. It doesn’t have to be normal (it can be any distribution), but should be IID. Why? Because permutation assumes directly exchangeability of the data. If there is a difference – the rule is broken, so the method is broken.

1b) If, instead of an exact permutation test, an approximate test is used (only a subset of all permutations are employed), the p-value won’t be exact too.

2) bootstrap provides only asymptotic and only average coverage probability (“95%” approaches the requested 95%). In certain industries, like the Clinical Trials, it’s often unacceptable. Here we often use the conservative, exact methods, giving the minimum (not average) probability coverage (usually we also want the shortest CI).

BTW, the FDA advises against using boostrap for the primary endpoints. The statistical properties aren’t still fully explored in case of strong skewness and kurtosis.

https://www.fda.gov/media/102657/download (it’s a draft guidance, but people comply)

3) it requires some data to work, a few tens at least. It must be sampled representatively, or – garbage in = garbage out, regardless of the number of samples.

4) it requires lots of samples. One may end up with tens-hundreds of thousands or the estimates get unstable There’s no agreement on that, still.

* “Bootstrap confidence intervals – how many replications to choose?” https://tinyurl.com/y2sqnc54

* “Why on average does each bootstrap sample contain roughly two thirds of observations?” – https://tinyurl.com/y4qwb3yn

* “Rule of thumb for number of bootstrap samples” – https://tinyurl.com/y5jvpklb

* “Choosing the number of bootstrap resamples” – https://tinyurl.com/y33old8s

* “Can we use bootstrap samples that are smaller than original sample?” – https://tinyurl.com/y2pqxhff

5) excessive skewness / fat tails may affect it

6) One should decide on which type of a CI to use. BCa is commonly advised, but too often fails to calculate (it happens to me very often). The percentile CIs are often blamed (not once I was disallowed by our sponsors to use them at work, when the BCa fails. Studentized CIs were requested instead)

* “What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum” – https://tinyurl.com/y55l5tel

* “Bootstrap confidence intervals: when, which, what?A practical guide for medical statisticians” – https://tinyurl.com/ycttwfnk

* “[SAS] The bias-corrected and accelerated (BCa) bootstrap interval” – https://tinyurl.com/y3qn5qf2

7) seed of the RNG should be always saved for reproducibility.

I hope you find something useful in it 🙂

David says

Hi Jim,

Thanks for the excellent report. FYI, Scientific American 1983 Vol 248(5):116 may be useful. My question asks whether you have commented on the issue of multiple testing (and proper corrections (i.e. Bonferrioni, Benjamin-Hochberg) therein versus resampling. Literature/blogs indicate this to be a longstanding and unresolved issue. I would be curious to read your comments.

Kudos, David.

Jim Frost says

Hi David, I’m much less familiar with multiple comparison methods for bootstrapping than for parametric tests. So, I don’t have a great deal of insight to add here. However, I’d imagine it’s a solvable problem. Both bootstrapping and parametric tests use a sampling distribution to calculate probabilities determine significance for a single comparison. Bootstrapping uses resampling to create its sampling distributions while parametric methods use probability distributions (t, chi-square, F, etc.).

Multiple comparison methods will typically adjust what is considered significant. In simple terms, if you know the Type I error for a single comparison, you can figure out the family-wise error rate. I write about this in my post about ANOVA and multiple comparisons. You can then adjust the individual Type I error rate so that the family rate is a desired value. So, I don’t have first-hand knowledge about these methods for bootstrapping, but they should be solvable.

I did a quick search and there do seem to be multiple comparison methods available for bootstrapping. Here’s an article about using multiple comparison methods with bootstrapping: On Using the Bootstrap for Multiple Comparisons.

Thanks for the great question!

Emily Stern says

Thanks for your intuitive blog post. Very helpful.

Dan Grove says

Hey Jim, thanks for this post! I didn’t know anything about bootstrapping before reading this. Explaining it via statistical concepts that I am familiar with (and the easy to follow example) made it a really easy and informative read. Cheers!

Swathi says

Hi sir

Myself swathi. Unknowingly I opened your blog. It’s really a lovely explanation about bootstrap .Before reading this article I thought that bootstrap is a complicated topic but after reading my thoughts have changed.Before your article I read

Many articles but my problem was not solved.thank you so much for your lovely explanation.

Akhil Mathew says

Hi Jim,

I love your blog posts and they have helped me understand many concepts. Now I came across a scenario where I need to know how would you solve this.

My question is regarding how to compare two or more bootstrap results. I am trying to do a AA test to find best matching control cases for a test group from a big pile of observations. Basically once I check for the differences, there should not be any as I am giving no special treatment to the test group. I do a bootstrapping once I finish finding test and control cases to see sampling distribution of the differences between the test and controls.

Problem is I am trying different approaches to match the test and controls, but I am not sure how to compare the bootstrap results. Most of the results are of mean around 0 but not 0 (luckily none of my means were above 1) but with wide range of standard deviation. I want to know whether there is any metric find best result with the least mean and least standard deviation.I have thought about just multiplying mean and standard deviation but do not think it doesn’t have statistical explanation. Differing mean and standard deviation is actually not helping me with z-score values.

If you still didn’t understand my problem, this question is almost the same thing. https://www.researchgate.net/post/How-can-I-compare-two-errornormal-distributions-to-find-which-one-is-better-rather-than-simply-finding-difference-in-their-means

Most of the people here suggest to go with the distribution with least standard deviation which means least uncertainty, but what if the mean value was 0.9? I want the least values for both 🙁

Please let me know how would you solve this problem!

Thanks,

Akhil

Rachel says

Hi Jim. You say sample sizes as small as 10 can be used. Is this a rule of thumb? Can you point me to any relevant literature regarding this?

Al says

Hello Jim –

Thank you for this blog post. The explanation was great. You mentioned that bootstrapping can be used to estimate regression coefficients. Can you present some post on how this can be done using software like Excel or Stata or R?

Kalie says

This explanation was so helpful – much better than what I received in class. Thanks so much!

Jim Frost says

Hi Kalie,

Thanks so much for writing! Your comment made my day! 🙂

Mahmud says

That’s a very nice introduction about Bootstrap sampling. I had some ideas of Bootstrap sampling but I was not very clear about all the aspects. Your clear cut explanation makes everything very clear to me.

Thanks again. May Alllah bless you.

Steven Zenos says

Hi Jim, if we believe the central assumption for bootstrapping to be true, namely the original sampling results accurately represents the actual population, can it be used to increase precision in DOE, by bootstrapping every ‘y’ dependent variable outcome?

Jim Frost says

Hi Steven,

I’m not exactly sure what you’re proposing and what you’re using as a baseline level of precision for that comparison. However, if you’re talking about using bootstrapping to increase precision as compared to a parametric analysis, that depends. If the parametric analysis adequately fits the data, you probably won’t get better precision with bootstrapping. However, if there is some assumption violation that causes an unresolvable problem with the parametric analysis, then bootstrapping might improve your results. Although, that might do more to reduce bias than necessarily increasing precision–but it would be an improvement.

Lasheen says

Thanks a lot for your help.

Lasheen says

Thanks a lot for your reply. I think that I got your point. It seems to me that when we are talking about the bootstrap or nonparametric test, it is easier to speak in terms of the confidence interval than the hypothesis test. On the other hand, when we describe a t-test, it is more convenient to talks in terms of the hypothesis test than a confidence interval. Actually, they are two heads of the same coins. Is that right?

Jim Frost says

Hi Lasheen,

Again, no, t-tests and bootstrapping are really more convenient to discuss in the way you’re saying. They’re actually fairly similar. The key difference is how they estimate the sampling distribution. After that point, the hypothesis test and CIs for both methods are fairly similar.

And, yes, hypothesis tests and CIs are two sides of the same coin. My post about how confidence intervals work explores this fact.

Lasheen says

Thanks for your reply. If we used the bootstrap distribution as the null distribution, and we set the mean value of that distribution as the null hypothesis. Then we will need to prove that the sample lies within, for example, 95% of the distribution. That is opposite to the t-test where we need to prove that the sample lies in the tail of the distribution. Is that make sense.

Jim Frost says

Hi Lasheen,

The bootstrap version of the t-test and the actual t-test follow the same basic approach. They are

notopposite. They just use different methodology to calculate the sampling distribution.I’d read my post about how hypothesis tests work. That is based on a one-sample t-test. When I should the sampling distribution of the means, which is based on the t-distribution, just mentally replace it with the bootstrap distribution centering on the null hypothesis value (the reference/target value). If the sample mean lies beyond the 2.5th or 97.5th percentile of the bootstrap distribution (but centered on the null value), then it is statistically significant.

Lasheen says

Thanks a lot for your post, very easy to follow. I am wondering what is the null hypothesis (and hence null distribution) in that context.

Jim Frost says

Hi Lasheen,

It would still be the same null hypothesis as in the parametric scenario. In this case, we’re talking about the sample mean for one-sample. We’d need to use a reference or target value for the null hypothesis value just like we’d do for a 1-sample t-test. The test (either parametric or bootstrapping) determines whether the difference between your population parameter estimate and the reference value is statistically significant.

Michael says

Hi Collin. you brought up an interesting point, “if you could first calculate all the possible combinations”. I took a stab at making this calculation with Jim’s data.

Assuming 92 people, order does not matter and we can have a person in the output more than once I think the number of total possible unique samples is 7.2016213874e+53.

I imagine there is no software readily available that could feasibly run this number of simulations!

With Jim running 500,000 simulations, what is the probability that that any of the simulations produced identical results!?

Jim Frost says

Hi Michael,

I describe how to calculate the number of unique samples in my reply to Collin. I’m not sure how you calculated yours, but the correct result is different.

famousdavispmp says

Collin, if you’re not too familiar with Monte Carlo simulation, you might find this spreadsheet helpful. I used it when doing a webinar a couple of months ago. This is an Excel spreadsheet, but it should work with Google Sheets users, too, since it uses built-in Excel functions (no plug-ins, nothing to install, it’s just a spreadsheet).

https://www.statisticalpert.com/download/1642/

In the spreadsheet, I simulate the rolling of two, six-sided dice. The question I’m trying to answer through simulation is how likely will I roll a 7? We already know the answer. There are six ways to get a “7” when rolling a pair of six-sided dice, and there are 36 possible combinations, so, 6/36 = 16.67% likely of rolling a 7 for any single roll.

In the spreadsheet, you’ll see how the greater number of simulated trials, the more accurate the simulated results are when compared to what we know is the true value getting a “7” (16.67%).

When we simulate with 100 trials, the results are not too accurate. Simulating with 1000 trials improves accuracy, and with 10,000 trials even more accuracy is achieved. Had I included a worksheet with 100,000 trials, the results would have been very, very accurate.

Of course, we don’t need to run a Monte Carlo simulation to solve a simple problem like this. We use Monte Carlo simulation when the problems are much more complex and the answers are anything but obvious and couldn’t be solved simply by using a math formula.

Collin M says

thank you Jim for that insight

Collin M says

hi Jim, thank you for making statistics so intuitive. Am an undergraduate student of Bsc.Agriculture, your posts have made me feel like a PhD student who can interpret the whole research process.

But my question is, why did you choose 500,000 bootstrap samples other than +/- 100,000 bootstrap samples for example.

I was thinking if you could first calculate all the possible “combinations” or “orders”, let me call them so, which can come out of that sample. In other words, how many different samples you can re-arrange from that sample without making repeatitions of the bootstrap samples.

Jim Frost says

Hi Collin,

There are several reasons why I chose such a high number. For one thing, it’s more likely to produce a nice smooth graph that looks nice. Additionally, if anyone tries to replicate the results, their results will tend to be closer to mine with higher numbers. By the way, I’ve added a link to my script in this post so anyone can try it on their own. You can easily change the number of bootstrap samples in the script

However, I probably used far more bootstrap samples than needed. I just reran the analysis with 100,000 bootstrap samples and obtained virtually identical results. Modern computing power makes it easy to go overboard! For the program to create the 500,000 samples and perform all the follow up calculations, it took only a matter of seconds! And, there’s no harm with going with more samples than necessary. A good rule of thumb is to increase the number of bootstrap samples up to point where you’re getting consistent results from one run to the next. If the results change much between runs, you have too few bootstrap samples.

For calculating possible combinations, it’s just n^n. Where n is sample size. My dataset has 92 values, which means the number of possible combinations is 92^92 = 4.66e+108! That’s a huge number! While 500,000 seems like a lot, it’s a tiny proportion (1.07e-175 to be precise) of all possible combinations. Although, again, 500,000 is more than sufficient for this dataset.

famousdavispmp says

“I wasn’t quite clear on your Excel methodology. Did you create 500,000 bootstrapped samples using resampling with replacement where n=92 and then calculate the mean for each one?”

Yes, that’s exactly what I did. Once I setup the model, I just tell @Risk to run a simulation 500K times. Each simulation uses n=92 with replacement. @Risk creates the bell-curve thanks to the CLT, which is what I used to compare its CI with yours.

@Risk is sophisticated, so I’m sure there’s a reason why the CI is different, but it’s just a curiosity to me now because using Excel’s built-in functions I can almost identically match your results, so it confirmed I understand the process correctly.

Jim Frost says

Great! I’m not familiar with @Risk. I might need to look it up! There are different bootstrap CI methods. I’m sure they’re just using one of the other methods.

david says

Hi Jim ! I wondered if maybe you know Why is using bootstrap to compute a confidence interval for the maximum value of a variable is problematic?

Jim Frost says

Hi David, the maximum value possible in the bootstrap simulated samples is the maximum observed value in the actual sample. Bootstrap samples are unable to go beyond that maximum sample value. It’s a sharp cutoff. However, sampling distributions based on a probability distribution can go beyond the observed value in a sample. Confidence intervals based on those probability distributions can therefore use that information that lies beyond the maximum observed value. The same is true for minimum values.

famousdavispmp says

Hi Jim, here’s a follow-up comment. I tried using Excel’s built-in function, PERCENTILE.EXC, against 4000 simulated iterations (of 92 samples each) by creating a big data table in Excel (so I wouldn’t use my @Risk plug-in at all). Using this approach, I nearly exactly matched your results: 2.5% was 27.20 and the 97.5% was 29.99, which of course nearly matches your simulation.

So for some reason, when I use the Palisade @Risk program to run the simulation, I’m getting a slightly different result, but I have no idea why.

Jim Frost says

Hi,

I’m not familiar with Palisade @Risk or the methodology that they use. I do know that there are several methods for creating bootstrap CIs. I used the simplest because it was easiest to illustrate.

I wasn’t quite clear on your Excel methodology. Did you create 500,000 bootstrapped samples using resampling with replacement where n=92 and then calculate the mean for each one? I wasn’t quite clear about the portion where you write, “Then I take the AVERAGE of those 92 bootstrapped values, and run the simulation 500K times.” Maybe you’re saying the same thing a different way? The results in your second comment are almost identical!

famousdavispmp says

Hi Jim. Thank you for your article. I downloaded your dataset of body fat samples and tried bootstrapping in Excel to see if I could match your results. I’m using Excel’s RANDBETWEEN function 92 times for each bootstrapped iteration and finding the mean for each iteration. Then I use Palisade’s @Risk Excel plug-in to run the simulation 500,000 times just like you.

My results are a little different, though, and I’m wondering why? The mean from my simulation is 28.42, and the 95% confidence interval is 27.54 (2.5%) and 29.59 (97.5%). I knew my results wouldn’t necessarily match your results *exactly* bu they are different enough to make me wonder why?

The model is pretty simple. I opened your body fat dataset, and in a column over I use =INDEX($A$2:$A$93),RANDBETWEEN(1,92)) and copy that 92 times, one for each of the 92 rows of data. Then I take the AVERAGE of those 92 bootstrapped values, and run the simulation 500K times. Using @Risk, I can see the statistics for the cell with the AVERAGE function which is where I obtained my confidence interval.

Any thoughts why I can’t match exactly your results? I noted that you’re using a different program to run your simulation.

[email protected] says

noted with thanks

Stan Aleeman says

I have some very good articles on resampling; permutation, jacknife, bootstrapping. If you send me an email – [email protected] – I will send them to you.

Stan Alekman

Khan says

Can I have an articles on types of bootstrapping? Which type of bootstrapping is used in Sem-amos?

Jim Frost says

Hi Khan,

As of now, I have just this one article about bootstrapping. I might write more down the road. I’m not familiar with SEM AMOS other than it’s an SPSS module for structural equation modeling. So, I’m not sure what methodology they use.

Aarav says

Hi Jim,

Thanks for the wonderful post. Can you create a post of using bootstrapping for hypothesis testing?

Thanks

LJ Legaspi says

Sir, is bootstrapping a nonparmetric test?

Jim Frost says

Hi LJ,

There are actually nonparametric and parametric forms of bootstrapping. The most common form is the method I show in this post, which is a nonparameteric method. This method creates new samples of the same size using sampling with replacement as shown.

However, there is a parametric form of bootstrapping. That form assumes that the population your are studying follows a particular distribution, such as the normal distribution, Poisson, or whatever. The procedure then estimates the parameters for that distribution from your sample. The procedure then uses the estimated distributions to produce the new samples.

So, yes, the type I show, which is the most common, is a nonparametric. However, just be aware that there is also a parametric type of bootstrapping.

Stan Alekman says

Resampling simply runs a Monte Carlo simulation on existing data to give some idea about the influence of extreme values. It does ask why extreme values or outliers are present. It does not test for outliers and cull them. It simply tries to average out their effects. But in non-experimental settings, outliers are critical. They are the signals that tell us of the presence of assignable causes. Resampling sidesteps the assumption of independent and identically distributed random variables without having to deal with outliers. The emphasis is completely upon estimation of parameters, not process characterization or improvement. Given this difference in emphasis, it works.

If I see appreciably different results between the usual tests and resampling, I would suspect the data of having come from an unpredictable process. In that case the resampling results would provide estimates with less variation, but the question of whether or not those estimates were estimates of one parameter or many different parameters would remain unanswered. Resampling works with data that are mostly homogeneous with only a few outliers.

Stan Alekman

Dwasch says

Hello.

Thanks for this helpful summary. Am I correct in understanding bootstrapping doesn’t rely on either a normality assumption or (for group comparison) a homogeneity of variances assumption? If so, could you point to a references without too much hassle? Would be helpful for an revise and resubmit.

Thanks!

Stan Alekman says

The question then remains: is the bootstrap confidence interval more reliable (closer to the truth) than is the confidence interval by traditional means? Without an answer or consensus, decisions based on analysis will not necessarily be the best we can make. We strive to make the best evidence based decisions.

Jim Frost says

Hi Stan,

This being statistics, the answer is a definite, “it depends.” I know that’s not helpful but a blanket answer isn’t possible. There are some cases where your data just don’t fit an existing analysis. It might deviate from the assumptions too much. Or, perhaps the appropriate test does not exist. In those cases, bootstrapping is clearly superior.

However, in other cases where your data completely satisfy the assumptions of a proven test, it’s harder to make the case that either method is superior. I’d say that bootstrapping is more flexible in terms of the conditions and tests that it can handle. I also haven’t thoroughly researched bootstrapping and might be unaware of how it compares to traditional methods (such as t-tests and CIs) when your data do satisfy the assumptions. I wouldn’t be surprised if someone performed a simulation study to look into this question. If this is a question you face for a study, it would probably be wise to research it.

I also don’t know the properties of your data. My sense is that the more closely your data follow the normal distribution the more equivalent the two approaches become. However, as your data diverge from the normal distribution, I’d expect bootstrapping to become the better analysis. However, I’m not familiar enough with that literature to give you practical advice for making that decision.

In statistics, knowing which test is better typically depends on understanding the characteristics of your data and the stringency of the relevant requirements. This holds true for deciding between traditional vs. bootstrapping methods.

Stan Alekman says

I wonder. I collect a sample and estimate a mean and confidence interval by the traditional t-distribution.

Then I re-estimate the mean and confidence interval by boot strapping and find a somewhat different mean a narrower interval.

Is it appropriate to report the boot strap estimates? Will bootstrap estimates be acceptable for journal publications? Are boot strap estimates superior?

Jim Frost says

Hi Stan,

Unfortunately, I don’t have concrete answers for your questions. In terms of what journal publications will accept, that will vary by field and journal. Most journal articles I’ve read use the traditional t-distribution, tests, and CIs. I think that’s mainly due to familiarity and tradition rather than it being better. Most people are more familiar with the traditional hypothesis tests. However, that’s not to say that journals won’t accept bootstrap results. I’d look into what the journal has published as well others in your field. There is a good case to be made for bootstrap methods.

Where I think the bootstrap method really shines is for cases where you don’t satisfy the assumptions for a traditional test. Or, perhaps there isn’t even a traditional method for what you want to accomplish. That’s where I’d say that bootstrapping is superior. If you have data that satisfy the assumptions, my sense is that both methods are similarly good.

Sorry for the vague answer. But, I don’t think a concrete one exists!

Fizza says

Hey. How can I use bootstrapping for multiple regression?

Shashank Garg says

Thank you Jim for such a simple explanation of Bootstrapping. I was trying to get the initials of the design from long but was not able to figure it out. Now it will easier for me to understand further details of it. I was also not a supporter of the the theory that all phenomena are normally distributed. Although, bootstrapping also makes assumptions, still we have something new to ponder.

Jim Frost says

Hi Shashank,

You’re very welcome! As some one who “grew up” on traditional hypothesis testing procedures, learning about bootstrapping was very interesting.

In traditional hypothesis, it’s true that not all distributions are normal. However, the central limit theorem is our friend in that regard because, with a large enough sample, the sampling distributions approximate the normal distribution, which satisfies the assumption for those tests.

saroja says

i love the way you make the concepyt clear

Debanjan says

So, should my original mean fall within the bootstrap confidence interval or not?

Jim Frost says

For a 95% confidence interval, you can be 95% confident that the interval contains the population mean. The population mean is the unknowable parameter that we’re estimating with a sample. So, yes, the process typically produces intervals that contain the population parameter. However, occasionally it won’t because of an unusual sample. Of course, this assumes that you’re drawing a random, independent sample.

Karan Desai says

Hi Jim,

This is the first time I read about bootstrapping and loved the concept. No wonder the name of the method is bootstrapping. You have explained it really well. Your blog is a gem.

Stanley Alekman says

Thanks for the explanation. I failed to understand that earlier.

Stan Alekman

Jim Frost says

You bet. And, it probably means I didn’t explain it clearly enough!

Stan Alekman says

Not to be argumentative, inference from a single non-representative sample as opposed to a hundred thousand resamples from a single non-representative sample seems like the wrong direction to take.

Jim Frost says

It actually works out to be fairly equivalent. The traditional approach uses the sample to calculate a sampling distribution, such as the t-distribution. That distribution is calculated from your one sample and it is equivalent to the distribution you’d obtain after performing the analysis (e.g., t-test) an infinite number of times. If your sample is not representative, that distribution will not be correct.

I’m not trying to convince you to use bootstrapping by any means. But a non-representative sample will affect the sampling distribution for both approaches because both use a single sample to estimate a sampling distribution. The methodology to produce that sampling distribution is different (resampling vs. formulas), but the end results are similar.

I haven’t used bootstrapping methods extensively myself. My training and experience has been with the traditional methods. However, the research that supports the validity of bootstrap methodology is very strong.

Stan Alekman says

Thanks for the info re bootstrapping regression coefficients, etc. Frankly, I am distrustful of bootstrap estimates. The underlying assumption is that the original sample mimics the population. It is very difficult to collect truly random samples in industrial settings.

Jim Frost says

I think in general it’s harder than commonly recognized to get a truly random, representative sample. The only thing that I’d add is that the traditional statistical methods also assume representative samples. So, if it’s that’s a problem, it’ll affect both bootstrap and traditional methods.

Aijaz Ahmad Dar says

I am interested in bootstrapping and I am using it. But I am having a question that i asked to many but I don’t get the answer. My question is how to find the Confidence interval (C.I) for the support parameter (I mean the situation where MLE is the first order and nth order statistics). example in Pareto distribution, power distribution.

Stan Alekman says

Thank you. Look forward to it.

Stan Alekman says

Can you reply to my specific question regarding tolerance intervals by bootstrapped mean and bootstrapped standard deviation? This would be an excellent procedure, if valid, to generate precise tolerance intervals.

Thank you.

Stan Alekman

Jim Frost says

Hi Stan,

You can create bootstrapped tolerance intervals. I don’t know enough about it right know to give you an intelligent response about it. That’s forthcoming after I learn more!

Stan Alekman says

Can you prepare an article describing how to bootstrap regression coefficients, and regression coefficient confidence intervals?

Can bootstrap estimates of means and standard deviations (as in your example) be used to estimate tolerance intervals using the bootstrapped mean +-k*bootstrapped sigma where k is the smallest value in the table since hundreds of thousands of bootstrap sampling steps are used to estimate the bootstrapped sigma?

Regards,

Stan Alekman

Jim Frost says

Hi Stan,

I was wondering what the reaction would be to bootstrapping. I had hopes there would be interest in it. I think it’s safe to say that there will be more articles about it!

Matt says

Jim, great article, generating lots of discussion among my peers. Thanks.

“An Introduction to Statistical Learning with Applications in R” by Gareth James et al has a short section (5.2, pages 187-190) on bootstrapping, with an example on regression coefficients. Essentially the bootstrapped samples draw the X and Y data from the original, then you figure the regression coefficient for each bootstrapped sample. Across all bootstrapped samples, figure your statistic of the coefficient.

Sampath says

It’s really interesting post. Thank you Jim.

Jim Frost says

Thank you, Sampath! I’m glad you enjoyed it.

Mcpheson says

Nicely intuitive.

Jim Frost says

Thank you!

محمد عبدالله محمد احمد says

Thanks a lot Mr. Frost

Jim Frost says

You’re very welcome!

ihsanullah says

please examples

Jim Frost says

Hi, I include a great example right in this post! 🙂