Side Project · NLP · Web Scraping
What Makes a 5-Star Product?
Sephora Review Analysis
Motivation
I spend way too much time reading Sephora reviews before buying anything. Like, embarrassingly long. At some point I started noticing patterns — the language in 5-star reviews felt qualitatively different from 1-star reviews, and not just in the obvious “I love it” vs. “I hate it” way.
Five-star reviewers talked about specific results and timelines. One-star reviewers fixated on texture and scent. I wanted to see if the data actually backed up my intuition, or if I was just pattern-matching on a biased sample of reviews I happened to remember.
Plus, I was looking for an excuse to practice web scraping and NLP. This felt like a natural intersection of “things I know about” and “things I want to learn.”
Approach
I scraped product pages from Sephora’s skincare category using BeautifulSoup, collecting ~15,200 reviews across 180 products (moisturizers, serums, cleansers, and sunscreens). Each review included the star rating, review text, skin type, and whether the reviewer received the product for free.
The analysis pipeline was straightforward:
- Text preprocessing: lowercasing, stopword removal, lemmatization
- TF-IDF vectorization to find terms most distinctive to 5-star vs. 1-star reviews
- NLTK VADER sentiment scoring on review subsets grouped by product attribute
- Manual tagging of ~500 reviews for attribute extraction (texture, scent, packaging, results, price)
- Correlation analysis between attribute mentions and star ratings
I filtered out reviews from users who received free products (marked as “incentivized”), since they skew significantly more positive and would muddy the signal.
Key Findings
The TF-IDF analysis surfaced some clear patterns about what language separates top-rated from bottom-rated products.
Top Distinguishing Terms (TF-IDF Scores)
| Term / Phrase | 5-Star TF-IDF | 1-Star TF-IDF | Signals |
|---|---|---|---|
| “results within” | 0.42 | 0.05 | 5-star |
| “broke me out” | 0.03 | 0.51 | 1-star |
| “sticky / greasy” | 0.04 | 0.47 | 1-star |
| “holy grail” | 0.38 | 0.01 | 5-star |
| “absorbed quickly” | 0.35 | 0.06 | 5-star |
| “strong smell” | 0.07 | 0.39 | 1-star |
| “repurchase” | 0.44 | 0.02 | 5-star |
| “not worth the price” | 0.01 | 0.36 | 1-star |
The attribute correlation analysis confirmed my hunch: texture is everything in skincare reviews.
Product Attribute Correlations with Star Rating
| Attribute | Correlation | Interpretation |
|---|---|---|
| Texture (lightweight, non-greasy) | +0.61 | Strongest positive predictor of 5-star |
| Results timeline mentioned | +0.48 | Reviews citing specific timelines rate higher |
| Scent complaints | -0.52 | Second strongest negative signal |
| Packaging aesthetics | +0.31 | Modest but consistent positive effect |
| Price-to-size ratio | -0.44 | Value perception drives many 1-star reviews |
Headline Finding
Products that mention “results within 2 weeks” in their marketing (and whose reviews echo that timeline) receive 40% more 5-star reviews than comparable products without time-to-results framing. Texture complaints — “sticky,” “greasy,” “pilling” — are the single strongest predictor of 1-star ratings, even more than whether the product actually worked.
Limitations
This was a fun exploration, but it has real methodological gaps.
- Survivorship bias. Sephora’s review system heavily skews positive — the average rating across all products was 4.2 stars. People who hated a product are less likely to leave a review at all. My 1-star sample was only ~1,800 reviews out of 15K.
- Scraping is brittle. Sephora changed their page structure twice while I was collecting data. The scraper broke both times and I had to patch it. A proper project would use their API (if they had a public one) or a more robust scraping framework like Scrapy.
- TF-IDF is a blunt instrument. It captures word frequency differences but misses context and sarcasm entirely. A transformer-based approach (BERT fine-tuned on review data) would catch more nuance, but that felt like overkill for a side project.
- Correlation, not causation. Products with good texture get better reviews — but that could just mean better-formulated products are both nicer to use AND more effective. I can’t disentangle the two from review text alone.
- Manual attribute tagging doesn’t scale. I tagged 500 reviews by hand to build the attribute extraction categories. With more time, I’d train a multi-label classifier to do this automatically.
What I Learned
Web scraping is 20% writing the scraper and 80% dealing with edge cases, rate limiting, and sites changing their HTML. I have a much better appreciation now for why companies pay for data providers instead of scraping everything themselves.
On the NLP side, I learned that simple methods go surprisingly farwhen you have domain knowledge. TF-IDF plus knowing which attributes matter in skincare reviews got me 80% of the insights. The remaining 20% would require much heavier modeling — classic Pareto distribution.
The most interesting takeaway was about consumer psychology: people forgive a product that takes time to work (as long as the brand sets expectations), but they won’t forgive a product that feels bad on their skin. Texture is table stakes — without it, efficacy doesn’t matter. That’s a useful insight if you’re building a skincare brand, and it came entirely from review text analysis.