Monday June 5, 2021:
Ames Housing Dataset
I have decided that it would be interesting to try a real challenge. The Ames housing data set with about about 3,000 records and 80 features. It was designed to be a more realistic update to the Boston housing dataset.
The great thing here is I really have no idea what to do. That means I can explore it any way I like. I read up on it and the professor who created the dataset provided a lot of details about how he presented the assignment to his students.
I think I will take some notes from his write up and follow the 3 weekly objectives he sets up.
- Week 1: (the simplistic model) Only limited manipulation of variables is necessary. Additionally, you may eliminate any data points you deem unfit. Try to get a model with a minimum r-square of 73% and contain at least 6 variables.
- Week 2: (the complex model) This will be my best effort to get a predictive model.
- Week 3: (final complete analysis) This will consolidate all the analysis, interpretation, and information from the two previous weeks.
All the information about this dataset can be read here.
I set up my environment, read about the data set, and played around with the the categories. I can’t conceive of working with all 80 categories at once. I don’t know if there is a standard practice with this quantity of information.
I decided that it would be reasonable to divide up the features into topical groups.
- surrounds: These categories relate to anything about the neighbourhood, street, or lot configuration.
- necessities: These are the utilities and A/C or heater installations
- ext_generalities: These are descriptions of building type, style, overall quality, age, and other exterior materials or features.
- int_specifics: These are descriptions of the standard rooms contained inside the house
- extras: These are amenities that are not standard in every house are perks or features that either have been exploited (like a pool or a finished basement) or could be exploited (like an unfinished basement or garage)
- sale_situ = These give us indications of the situation in which the sale took place, like the type of sale (a contract or warranty deed) and condition of sale (a foreclosure or trade).
Organizing the features this way, gives me 6 separate more manageable datasets. So, each data set will contain the Price column, which is our target. This gives me the ability to focus more on each aspect of the house and maybe start getting an idea of what preprocessing I need to do. For example, how to handle certain missing values.
This week and the weeks to come I will do a little work on this. I might alternate between this and another project or topic I would like to study. It’s a big one and I don’t want to get burnt out if I run into a walk. I also notice that some surprising solutions come from adjacent topics that are not the main focus.