4 Days of Data / Round 2: - TheConversationBreak

Thurs July 1, 2021

Watched StatQuest videos

They were very basic things about histograms and different distributions such as the normal and curve. The normal distribution, I know well by now. I understand that its prevalence is based on the Central Limit Theory which says that given enough samples all distributions tend towards a normal distribution. Which means simply that the mean is the highest point, so, the mode and the median are all the same value. 95% of the measurements fall within 2 standard deviations above and below the location of the central tendency. It benefits me greatly to spell everything out because I do not have a background in statistics so the best thing is that i can internal quickly the precision of these definitions.

The importance of the bell curve is that:

it gives us the probability of measurements that we haven’t yet observed
it can save time and money in collecting measurements
it can give us intermediate measures that break outside the limitation of the bins in a histogram

To draw the curve you need:

the average measurement gives you the center of the curve
the standard deviation gives you the width of the curve

Attended a Cloud Computing Seminar

This is definitely not in my wheelhouse at the moment but it will be at some point. The biggest take-aways were just seeing more of the bigger picture. That server-less computing allows for less on-premise hardware and the use of third-party services (Infrastructure-aaS, Software-aaS, Function-aaS). For example a VPC(Virtual Private Cloud) service allows for rapid growth and expansion because it is easy to scale up or down. Memory or computing power hardware do not need to be invested in and can be “leased” as needed.

I got a clearer picture of Kubernetes. I understand that with tools like Docker you can create apps that are treated like “containers”. These containers are independent from other containers and are small and lightweight. These containers interact with your machine and others and are never installed on these machines. Kubernetes is the orchestrator that manages the containers and the machines.

This is the first time I have heard of some of these services and my framework is still a little shaky. But I will learn more in the future about this I am sure.

Trained some ML models with Sklearn

I got some better results this round. Probably because of applying some of the techniques I have learned about during the preprocessing stage. For the moment, the preprocessing is where a lot of important decisions are made that will influence the effectiveness of the models. Training the models in comparison is fairly easy and a blast! I’m sure I will find the devil in the details is hiding also in the hyper-parameters but I am enjoying trying different models. But once preprocessing is over, it is like ‘plug-n-play’.

Below I will just post the coefficients, MSE, R2(coefficient of determination), and RMSE, along with the scatterplot and distribution of both y_pred and y_test. Now because we put the data through a log transformation we need to undo the log.

Linear Regression

KNN

SVR

Random Forest

This is the result without even adjusting hyper-parameters. Random Forest is pretty much on point and faster than SVR. Now, this is still the training data I haven’t opened the real test data of course. But my first round, through this data set I got these results from LinearRegression:

So, I must have learned something since then…

(Edited: mistakes were made….I forgot that I had applied a log transformation to help the ML models train better but when comparing with my previous attempt I needed to undo the transformation and in general the log must be undone when measuring RMSE, R2, and MSE because the discrepancies in those metrics will be minimized because that’s what log does. Images have been updated to reflect this. But still, I got better results this time around and that is what counts.)

Leave a Reply Cancel reply