2 Days of Data / Round 2:

Tuesday June 29, 2021:Worked on setting up this blog yesterday. I will need to work on a few details such as the readability, maybe choose a different font, add some images.

Applied to some jobs. Trying to find some startups and connect with real people.

I am trying to find the appropriate way to handle the categorical data in the preprocessing stage for a ML model. I have data that is both categorical but also ordinal. I understand a DecisionTree will work with categorical labels.

But otherwise, I need to turn it into numerical data. I could use dummies which would turn each category into a numerical value. But I have three categorical features and each has about 5-7 labels. That would add close to 20 features, which I understand with 40000 rows would increase the size of the dataframe by a lot.

I understand that it needs to be smaller to be more effective, though not sure what that size is. I could use OrdinalEncoder with sklearn and that would add a unique integer which would help keep the number of features to a minimum. Or use OneHotEncoder from sklearn or get_dummies from Pandas which would

The other question is whether to use drop_first=True  or drop='first' which would drop the first column and prevent multicollinearity. However, Data School points out that this is not the best thing especially if you are not going put it through a Neural Network or Unregularized Regression, this is according to sklearn’s documentation.

I will create one version of each variation to try out and see what differences are apparent. I might then also try different models on each to see what results it gives me.

Got into visualization with Seaborn which after not having worked with it for awhile feels rusty. I am getting some Axes with no information and others with information. I am not sure what is happening there. Specifically pairplot, but I switched to scatterplot and it works fine. But if I place a row of Axes with pairplot, the first one is empty. Couldn’t figure it out. I would like to go through all the tutorials once again and refresh my grasp on it.

I also watched Ken Jee’s talkwith StatQuest’s Josh Starmer, here. And StatQuest has compiled a playlist to watch one video a day for the 66 day challenge.

 

Leave a Reply

Your email address will not be published. Required fields are marked *