Project Overview

Our project uses the online news popularity dataset to complete analyses for six different data channels:

  1. Business
  2. Entertainment
  3. Lifestyle
  4. Social Media
  5. Tech
  6. World

For each analysis, the data is appropriately subsetted and new variable(s) made. A few basic statistical summaries are calculated and a couple contingency tables are computed. Following the statistical summaries and tables, some graphs are constructed:

  • A bar plot of articles published per each weekday
  • A boxplot of shares grouped by weekday
  • A couple histograms of the shares variable
  • Three scatter plots to investigate the relationships between different variables

Following the graphs, two linear models are fit as well as a random forest and a boosted tree. Then, a comparison function is constructed that takes in all the RMSE values for the previously mentioned models and chooses the best model based on that criteria.

Prompt Questions

  • What would you do differently?
  • What was the most difficult part for you?
  • What are your big take-aways from this project?

To be completely honest, my partner and I handled this project pretty well. I’m not sure if there’s much I’d do differently in the future besides maybe just trying to start earlier. The most difficult part for me (and my partner) was the automation! We struggled with that a lot, though the rest of the project was pretty doable. My big takeaways from this project are:

  • Learning and seeing how to put together multiple analyses, which will be helpful with a corporate job.
  • Learning how to do the automation and how it works.
  • Learning how to deal with data in the form of doing a small EDA as well as trying to fit multiple models.
  • How to write descriptive code that changes depending on the analysis that’s being rendered.

The links to the GitHub repository as well as the GitHub pages are below:


<
Previous Post
Fourth Blog Post!
>
Next Post
Fifth Blog Post!