6 & 7 Days of Data / Round 2: Weekend Edition

Saturday-Sunday June 3-4, 2021

Weekends are lighter and less intense

I think this round I will fill my weekends with light reading and watching videos or listen to some podcasts (especially the Real Python Podcast) and doing some cleanup. Maybe I’ll look for some interesting news, like the Photonic Computer, which I discovered this weekend. There will be a commercially available photonic computer hitting the market at the end of the year. It will mostly impact the higher end of the ML market but still will be widely available. I don’t think I will be using it anytime soon but it will directly apply to the AI and ML fields. This is a good video that explores the topic with the CEO of Lightmatter.

Stats Corner
Aside from cleaning up the readme of my diamond prices project from last week, I also got a nice review of the basics of statistics with StatQuest’s playlist for 66daysofdata. He has put the videos in a convenient and easy to follow sequence that builds on each other. You can check that out here. It is very simple and worth the time to take notes.

Right now, I am reviewing the histogram, the distribution curve, and the reasons for these visualizations. The curve provides more information that may be impractical to collect. Since, we are usually working with smaller sample data we need to estimate the curve’s mean and standard dev. The key word here is estimate, not calculate. Because the mean and std dev derive from sample data we don’t know with certainty what the population parameters are. We, therefore, need to also provide the confidence we should have in these estimators. But providing the estimators and the confidence interval (p-value), we generate results that are reproducible for other samples.

Remember, if we have the population data, we calculate the population parameters. But, if we only have sample data then we estimate the population parameters. The way we estimate the population mean and standard dev is by dividing the sum of terms by n-1. The reason here is because the estimated pop params is almost always less than the calculated pop params. Therefore, the n-1 increases the value of the term.

In simple terms, in practice, we should give reasonable preference to the x-bar (sample mean) and n-1 formulas. Since, they are for estimating the pop params.

The unanswered question I have is why we don’t use absolute values instead of squaring the differences from the mean. That would eliminate the need for a two step formula for standard deviation. As is we always have to calculate variance which because it squares the differences of the data minus the mean, we have to find the squared root of the variance to get the std dev. I will reach that answer at another point.

Goals for this Week 2

I might try to refactor my fuel efficiency ML project. That was left wanting. I also have my PDF extraction side job that I need to give some attention. I will also continue with reviewing statistics. Maybe, I should try to get some visualizations with my fuel efficiency data.

Leave a Reply Cancel reply