2026-04-09

Introduction

  • This simple linear regression is used to determine the relationship between distance and price for ride share.
  • Focus: relationship between distance and price
  • Based on: Distance traveled in kilometer
  • Goal: Create a clear, simple, interpret able model
  • Method: simple linear regression

Simple Linear Regrassion Mathematical Model

\[ Price = \beta_0 + \beta_1\times distance + \epsilon \]

In this equation:

  • \(Price\) = is the expected value or the dependent variable
  • \(distance\) = is independent variable
  • \(\beta_0\) = is the price intercept the point where the regression line intersects y axis
  • \(\beta_1\) = is the slope of the regression line
  • \(\epsilon\) = is the error term the difference between \(price\) - \(\hat {price}\)

Data loading and cleaning Data in R Code

The dataset was sourced from Kaggle, downloaded locally, and then imported into R for analysis.

Data <- read.csv("uber_trips_dataset_50k.csv",
                 header = TRUE)
clean_data          <- Data[, c("trip_id","distance_km", "fare_amount")]
colnames(clean_data)<- c("ID","distance", "price")
                     #removing missing values
clean_data          <- na.omit(clean_data)
                     #Remove other values
clean_data          <-subset(clean_data,ID >= 0,distance >= 0 & price >= 0)
clean_data          <- clean_data[1:100,]

Clean Dataset Table

Data preparation

ID

distance

price

1

2.97

10.71

2

8.43

22.41

3

5.46

12.91

4

6.61

15.70

5

10.50

19.15

6

9.94

19.95

7

12.22

25.89

8

10.14

25.55

9

1.88

7.97

10

3.70

12.21

11

7.50

13.99

12

5.64

11.38

13

1.60

4.93

14

7.13

15.22

15

1.74

4.88

16

5.94

14.07

17

9.40

24.10

18

7.19

19.78

19

4.49

13.35

20

5.22

13.43

21

13.13

30.11

22

6.43

16.21

23

8.10

15.18

24

6.27

13.78

25

5.30

12.22

26

9.34

17.28

27

6.77

16.78

28

5.45

15.63

29

9.32

20.11

30

7.76

14.05

31

1.50

8.16

32

6.23

13.16

33

4.94

12.42

34

7.47

17.27

35

5.90

9.32

36

15.66

28.08

37

11.02

25.88

38

3.69

8.58

39

9.22

22.43

40

11.60

25.12

41

8.09

21.92

42

9.07

21.72

43

10.21

16.31

44

12.02

27.82

45

12.40

24.29

46

1.99

5.48

47

4.49

12.80

48

6.86

19.92

49

4.62

10.52

50

12.27

24.77

51

9.45

15.87

52

7.43

20.51

53

8.35

23.61

54

3.98

10.12

55

10.94

28.99

56

8.24

19.03

57

6.74

15.99

58

10.95

16.44

59

7.75

17.34

60

9.34

27.04

61

0.37

5.22

62

4.40

8.95

63

3.50

10.02

64

3.52

10.76

65

7.03

18.65

66

12.71

31.17

67

4.17

12.04

68

9.15

14.33

69

12.07

18.71

70

13.15

20.62

71

5.18

11.08

72

11.33

28.91

73

10.69

29.77

74

10.31

14.94

75

1.99

5.31

76

6.04

9.51

77

8.19

23.71

78

10.20

15.66

79

7.66

17.76

80

8.41

16.51

81

8.77

22.24

82

5.18

10.83

83

4.60

8.52

84

9.18

21.84

85

4.37

10.57

86

1.82

7.01

87

7.29

18.83

88

6.97

19.52

89

2.96

4.97

90

10.48

25.39

91

4.27

13.48

92

4.29

12.96

93

1.76

6.65

94

6.13

11.34

95

7.32

19.38

96

7.91

18.68

97

6.80

15.13

98

2.15

9.22

99

4.98

13.58

100

5.59

14.51

Code of Scatter Plot for distance vs price

p1 <- ggplot(clean_data, aes(x= distance, y = price))+
  geom_point(color = "blue")+
  geom_smooth(method = "lm",color = "black", se = TRUE) + 
  ggtitle("Uber Price vs Distance") +
  labs(
    x = "Distance in kilometers",
    y = "Price in Dollars($)")

Scatter Plot

## `geom_smooth()` using formula = 'y ~ x'

## Linear Regression Fit model

model <- lm(price ~distance, data = clean_data)
summary(model)
Call:
lm(formula = price ~ distance, data = clean_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.1904 -1.8427  0.2556  2.1418  6.9572 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.61420    0.76050   4.752 6.89e-06 ***
distance     1.79594    0.09754  18.412  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.124 on 98 degrees of freedom
Multiple R-squared:  0.7757,    Adjusted R-squared:  0.7735 
F-statistic:   339 on 1 and 98 DF,  p-value: < 2.2e-16
  • Simple linear regression model fitted

Regression Equation

This regression equation shows the relationship between distance and price. Also,intercept represents the base fare, and the slop shows how much the ride price increases per kilometers.

\[ \hat{Price} = 3.2 + 1.9\times distance \] - Base fare = 3.2
- Cost per unit distance = 1.9
- positive linear relationship between distance and price

Code For Regression Line visualization

Data$Residuals <- residuals(model)
p <- ggplot(clean_data, aes(x = distance,y = price))+
  geom_point(color = "#56B4E9",size = 3)+
  geom_hline(yintercept = 3.2 ,linetype = "dashed",color = "red")+
  ggtitle("Residual Plot")+
  xlab("Distance in Kilometers")+
  ylab("Residuals in Dollars")

Scatter Plot For Regression Line

Plotly Interactive Code for the linear Regression

p2 <- plot_ly(clean_data,
         x = ~distance,
         y = ~price,
         type = "scatter",
         mode = "markers",
         text = ~paste("Distance:",distance,"<br>Price:",price),
         marker = list(color ="darkblue",size = 10))%>%
       layout(title = "Interactive Uber Price Plot",
       xaxis =list(title = "Distance in Kilometers"),
       yaxis = list(title = "Price in dollars"))

Plotly Interactive Plot for the linear Regression

Data Visualization in 3D

 p2 <-  plot_ly(clean_data,
         x = ~distance,
         y = ~price,
         z = ~ID,
         type = "scatter3d",
         mode = "markers",
        
         
         marker = list( color = ~price, colorscale = 'Viridis',size = 6),
         text = ~paste("Distance:",distance,"<br>Price:",price)
         )%>%
         
       layout(title = "3D visualization of price vs Distance")

Plotly Visualization in 3D

Interpretation

  • Positive slope shows that the price is increase with distance
  • this indicate strong relationship between distance and price
  • The model can help to predict future ride share costs

Prediction

        1         2         3         4         5         6         7         8 
 8.948149 18.754001 13.420048 15.485383 22.471604 21.465875 25.560627 21.825064 
        9        10        11        12        13        14        15        16 
 6.990570 10.259188 17.083773 13.743318  6.487706 16.419274  6.739138 14.282101 
       17        18        19        20        21        22        23        24 
20.496066 16.527031 11.677983 12.989022 27.194935 15.162113 18.161339 14.874763 
       25        26        27        28        29        30        31        32 
13.132697 20.388309 15.772734 13.402089 20.352390 17.550718  6.308112 14.802925 
       33        34        35        36        37        38        39        40 
12.486158 17.029895 14.210263 31.738673 23.405495 10.241228 20.172796 24.447142 
       41        42        43        44        45        46        47        48 
18.143380 19.903405 21.950780 25.201438 25.883897  7.188124 11.677983 15.934369 
       49        50        51        52        53        54        55        56 
11.911456 25.650424 20.585863 16.958057 18.610325 10.762052 23.261819 18.412771 
       57        58        59        60        61        62        63        64 
15.718856 23.279778 17.532759 20.388309  4.278695 11.516348  9.899999  9.935918 
       65        66        67        68        69        70        71        72 
16.239680 26.440639 11.103281 20.047080 25.291235 27.230854 12.917184 23.962237 
       73        74        75        76        77        78        79        80 
22.812833 22.130375  7.188124 14.461696 18.322974 21.932821 17.371124 18.718082 
       81        82        83        84        85        86        87        88 
19.364621 12.917184 11.875537 20.100958 11.462470  6.882814 16.706625 16.131923 
       89        90        91        92        93        94        95        96 
 8.930189 22.435685 11.282875 11.318794  6.775057 14.623330 16.760503 17.820110 
       97        98        99       100 
15.826613  7.475475 12.557995 13.653521 

Conclusion:

To the End

Thank You