Feb 27 Post Update
Project Update: Scraping Progress and Movement Toward Algorithmic Implementation
Over the past week, I have been working on web scraping to gather all the necessary data using different techniques. However, I quickly realized that this task was taking longer than I had anticipated given the amount of data I needed, and I had to find an alternative way to gather the data I needed so I could move forward and at least multitask instead of being held back by the data issue.
After looking around the Internet for a while, I was able to find a previously-created database with csv files containing pitching statistics from the past five seasons. This database had almost all of the data I needed (for the baseline project, not the reach aspect), and it was organized in a way that was pretty easy for me to use. I was thrilled to find this resource as it allowed me to move forward with the project without any further delay.
With all the necessary data at my disposal, I want to begin to plan out my algorithmic approach to building the model. I will be starting to pseudocode to identify any potential issues or bottlenecks in the algorithm before I began writing actual code.
One of the challenges I faced was figuring out how to deal with missing data. In any dataset, there are bound to be some missing values, and it’s important to have a plan in place for how to handle them. The main aspect for the data that I need to figure out is how to get the umpire names in the dataset, which shouldn’t be hard with a hashmap.
Overall, I am pleased with the progress I have made on this project so far. By finding a pre-existing database with the data I needed, I was able to overcome a significant hurdle in the project and move forward with building the model.
- The csv file of the data is too big for GitHub.