I know this post is a long time, but surviving the summer became the priority. It’s hot. Everybody is on vacation, which I can’t indulge in this year because I am trying to maintain my momentum in the pursuit of a job. Now, I have completed this challenge for the second time I wanted to give a complete account of it for my blog. In the time since my last update, I have kept up my posts on the #66daysofdata discord. Here I will, give a quick rundown of each post.
– refactored my webscraper for the jobs website tecnoempleo. Now it is clean and parts are reusable.
– Watched some StatQuests about The Chain Rule and Gradient Descent
– Started a walkthrough of Hands on Data Analysis Pandas Chapter 2. I really, like this book. It doesn’t skip over the unseemly details of Data Analysis the many course do. You know, like preprocessed datasets that don’t pose any real threat. Real examples of data sets are can be very frightening things to wrangle.
– applied to many jobs
– walked through over half of chapter two Hands on Data Analysis with pandas.
– watched some statquests about statistical power and power analysis. Power is the probability that we are able to correctly reject the null hypothesis. Power Analysis calculates the sample size necessary for a high probability of correctly rejecting the null hypothesis.
– connecting to new database and populated it for a course I am going to start this week.
– Watched more Statquest about standard deviation and standard error. Standard error is the standard deviation of the means of the means of multiple set of data samples.
– Statquest on logs and how the scale of logs affects the distance on the number line/axis. It decreases the distance between points for values larger than 1 and increases distance between values between 1 and 0.
– Statquest on Linear regression models. This was a long one. It walks through the process of using the residuals for finding the sum of least squares, starting with the the line that represents the mean with a slope of 0 and then progressively incrementing the slope. If we plot these values it will give us the slope of best fit line.
But how can we know what the best fit line is? We have to calculate R^2, which is the Variance of the data to the mean minus the variance of the data to the fit line divided by the variance of the data to the mean. However, we need one more calculation that will represent the confidence we have in this result. We have to calculate F which will give us our p-values that will help us quantify the confidence we can have in the R^2.
So given the data you think are related, Linear Regression will:
1) quantify the relationship in the data with R^2. (This needs to be large)
2) determine how reliable that relationship is with p-values that we calculate from the F formula. (This needs to be small)
These are my default stats lessons. They are really great. Although, I’ve worked with many of the concepts, they are so clearly connected here that I when I finish watching all I can think about are statistics.
Aug 23 & 24
– I have finished two sections of the Advanced SQL course on Udemy. I am doing well with the challenges which I am glad to see.
– Statquest general Linear Models Part 2: t-test and ANOVA
– web scraped list of dictionaries containing info about Spanish startups where I can send CV for spontaneous applications. – Work on the second half of chapter two Hands on Data Analysis with Pandas
– Bad day….but I still applied to some jobs and some hands on data analysis with pandas
Aug 26 & 27
– worked on some MySQL problems from the Advanced SQL queries course
– prepared for an interview that I will be having on Monday
– started reviewing some python concepts that I haven’t practiced much such as error handling I had a lab assignment to practice with which was good.
– sorted out my github Personal Access Token – had an interview….it went well…waiting for the second interview tomorrow.
– Had my second successful interview for my first job in data….it is a temporary job and rather “soft” but will be working with a team of data scientists and developers and non technical people in order to act as “translator” of the technical aspects in the python code….hahaha….hey…if they need it….I’m pretty good at that. They work with DevOps so I’ll be able to add that to my CV plus work experience! – found a Python/Spark course to beef up my BIG data skills….missed out on an opportunity because I had never worked with any apache tools.
– did some of my Advanced SQL course…it is good to have something difficult to wrap my head around. This one is going to take me some time.
– attended a meeting about networking.
Aug 31 – Sept 2
I have spent the weekend reading a little statistics for ML and reviewing some terms and concepts:
– did a some more of the advanced SQL course – statistical power, power analysis, effect size, false discovery rate (Benjamini-Hoschberg Method), p-hacking, pooled estimated Std. Dev, Multiple Testing Problem….
– book I started reading: “Machine Learning: a probabilistic perspective” by Kevin Murphy. It’s a pretty hefty book. I expect I will have to pick at it over time.
– I also picked up “Information Theory, Inference, and Learning Algorithms” by David MacKay…this book is adjacent to what a data scientist needs to be proficient at. It simply gives me more to dive into when I want to fill in the gaps. Both books are a little hefty but….little by little.
– I had a meeting with my career coach….helping me modify my strategies for apply to positions and looking for any openings that might be suited for me as a later comer to data science. Changing professions is never easy but when you have a humanities background, people don’t take your CV seriously in a data science role, until you have experience. Stuck in recursion. Can’t get job till I have experience. Can’t have experience till I get job. (not necessarily true…but that is the basic source of a lot of my difficulties. It would be much easier if I had studied a quantitative field like engineering)
– I am going to continue my Round 2 until I really break my stride. I might try to tack on a 30ML challenge now that we have people joining from there.
Observations of my last 15 days:
– I was on a heavy rotation of web scrapers, Pandas, statistics, and advanced MySQL queries.
This was not the idea I had in mind for my blog. It should contain much more of a log of what I did and learned. I would like to start back on blogging because it really solidified what I accomplished each day. It is a great tool for both me and others to have an insight to my process. I will streamline this and start another round.